The commoditization of technology has reached its pinnacle with the advent of the recent paradigm of Cloud Computing. Infosys Cloud Computing blog is a platform to exchange thoughts, ideas and opinions with Infosys experts on Cloud Computing

« Amazon Aurora Serverless, the future of database consumption | Main | Demystifying- AWS Instance Tenancy »

AWS Datalake - Let's Dive Deep

With the surge in information technology, huge amounts of data need to be dealt with. Storing data at such a large scale and deriving meaningful insight of great business value is important. There are different ways to deal with it like data warehouse technologies and data lakes. This article will explore how AWS offerings supports data lake solutions.


Image Source- free images

Pentaho CTO James Dixon is credited for coining the term "data lake". He describes a data mart (subset of a data warehouse) analogues to a bottle of water" cleansed, packaged and structured for easy consumption" "while a data lake is more like a water body in its natural state. Data flows from the streams (the source systems) to the lake. Users have access to the lake to examine, take samples or dive in".


As we are moving in the world of IoT and machine learning to derive better and informed business decisions, data has become most valuable asset for organizations. From clickstreams to IoT, mobile apps to social media and data generated by business applications, it's all data. This amount of data is massive and organizations are looking for a way to deal with it. Consequently, data lakes are getting popular day by day.

Data warehouses are being used mainly for deriving operational reporting and analysis since long. Generically, it's a relational database which is apt for pre-defined schema and data structure while optimizing fast SQL queries.  The data received from transactional systems and business applications is cleaned, transformed and enhanced to be used as "single source of truth".

Whereas, data lakes can capture relational as well as non-relational data where data structure or schema need not to be defined. Which means that structured data from business applications to non-relational data like clickstreams and data from social media and IoT devices all can be captured in data lakes.

Business analysts and data scientists can then use this diverse data to run SQL queries, real-time analytics, big data analytics and machine learning to derive trends and conclusions of great business value. E.g. use of machine learning to predict future outcomes and prescribe actions for rapid response.

Think data lake as a centralized repository which can store all data in real time irrespective of the source, the structure, the size and its type. Data is kept in its raw form. It is only transformed when it is ready to be used. 

Analysts can process this raw data with some analytic tools and frameworks. E.g. open source frameworks like Apache Hadoop, Apache Spark and Presto as well as commercial offerings from many business intelligence solution vendors. These all analytics can be done without moving data to any separate analytics system.

But, the biggest challenge today with data lakes is that deploying and managing them requires lot of complex and laborious manual tasks. Like: -

  • Load data from diverse sources and monitor those data flows.
  • Match linked records
  • Turn-on encryption and management keys
  • Provide access to data sets
  • Setting up partitions
  • Define transformation jobs and monitor their operation
  • Deduplicate redundant data
  • Re-organize data into columnar format
  • Configure access control settings and audit periodically. 

Deploying data lake in AWS is easy, there are two ways, automated data lake build through combination of services which can be implemented through infrastructure as a code service of AWS called CloudFormation and another one is a managed service called AWS Lake Formation. 

Data Lake Solution on AWS

This solution has required AWS services to build a data lake solution described in a JSON or YAML template. This template can be executed to deploy data lake using AWS native services like AWS S3, AWS Athena, AWS Glue, AWS DynamoDB, AWS CloudWatch and AWS Elasticsearch.

Features of data lake solution on AWS -

  • Flexible and scalable- Flexible to ingest all types of data (as-is) at scale. Design components support data encryption, search, analysis and querying at scale.
  • Access control and data security- Granular access-control policies and data security mechanisms to protect all data stored in data lake.
  • Leverage AWS Managed Services- Eg. Amazon Kinesis, AWS Direct Connect or AWS Snowball/ Snowmobile to transfer large amounts of data and use AWS Data Pipeline, Amazon EMR, and Amazon Elasticsearch Service for data processing and analysis.


Image Source- Amazon Web Service

AWS data lake has a server-less architecture (no EC2 instance deployment and management). It uses S3 for storage and processing is done by a micro-services layer which is written using AWS Lambda. This solution deploys a data lake console into an Amazon S3 bucket which is configured for static website hosting and configures an Amazon CloudFront distribution to be used as the solution's console entry point. 

Below are the AWS offerings for data lake solution -

  • Amazon API Gateway - Provide access to data lake micro-services. These micro-services interact with Amazon S3, Amazon Athena, AWS Glue, Amazon Elastic-search, Amazon DynamoDB and Amazon CloudWatch logs to provide data storage, management and audit functions.
  • AWS Lambda - For Microservices
  • Amazon Elasticsearch - For robust search capabilities.
  • Amazon Cognito - For user authentication.
  • AWS Glue - For data transformation.
  • Amazon Athena - For Analysis.
  • Amazon S3- Storage to leverage security, durability, and scalability of S3.
  • Amazon DynamoDB - Manage metadata.
  • AWS KMS - For Security.

AWS Lake Formation

Last year in Re-Invent, AWS announced new managed service for data lake deployment called AWS Data lake Formation. It has even simplified data lake deployment.
So far AWS Lake Formation was only available in limited preview. However, from September 2019 onward it now available in general availability.


Image Source- Amazon Web Service

With data lake formation user just need to define where data should reside and what data access and security policies need to apply. Lake Formation then automatically takes care of below using machine learning algorithm. 

  • Import data from databases which are in AWS, from external data sources and from other AWS sources to amazon S3 data lake.
  • Catalog, label data and transform data.
  • Clean and deduplicate data.
  • It also takes care of security by enforcing encryption, managing access control and implementing audit logging.

Once this processed and transformed data is available, any analytics and machine learning service, like Amazon Elastic Map Reduce for Apache Spark, AWS Redshift, AWS Athena, AWS QuickSight and AWS Sagemaker, can be used to draw a meaningful vision from this data.

-- This article is written under Atul's guidance.


Nice Blog | hope will get similar success, Thanks for sharing your AWS experience, Keep sharing more blog posts.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.