Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« Why should Data Virtualization be a critical component of your Data Privacy Strategy | Main | How can we help you in Feature Engineering and Augmenting your AI/ML data needs? »

Privacy Next for AWS Data lake Security

AWS data lake has a server-less architecture (no EC2 instance deployment and management). It uses S3 for storage and processing is done by a micro-services layer which is written using AWS Lambda. However, building a data lake and making it centralized repository for all assets raise a concern on how to make it secure.

Why should you protect your AWS Data Lake?

An AWS data lake is a great option for warehousing data from different sources for analytics or other purposes but protecting data lakes can be a big challenge. Hence, with this growing array of options, the challenge in data lake management is ensuring not just comprehensive management across all stores, but also authorized access and data governance.

As AWS Data lake contains organization-wide data from multiple sources including personal identifiable information (PII) data and highly sensitive business data, so any breach of this data can result in privacy intrusion or the compromise of crucial corporate intelligence. To prevent damage, a data lake should meet the high standards of security.

What are the key challenges faced by our customers?

Using data lake as centralized repository, helps users to access data available across organization but creates a new challenge to protect and isolate different classes of data from unauthorized users. 

However, this leads to new security concerns like

  • How do we ensure that user is authorized for each data set accessed?
  • Are we sure that every computational environment accessing the data lake is secure and complaint with enterprise governance regulations?
  • How to ensure the consistent auditing of data usage?
  • Instead of relying on users to follow best practices, how can we create a policy-governed environment for company's data?

How can we secure the AWS data lake?

  • Using Authorization : You can manage access to your S3 resources using access policy options(AWS Identity and Access Management). By default, all Amazon S3 resources are private: only the resource owner can access the resources. The resource owner can then grant access permissions to others by writing an access policy.
  • Using Server-Side Encryption or SSE: Amazon S3 server-side encryption uses 256-bit Advanced Encryption Standard (AES-256), encrypts your data at rest in Data Lake. Each object is encrypted with a unique key further unique key is encrypted with a master key.
  • Using AWS KMS: KMS is a managed service to create and control the encryption keys used to encrypt your data when you transfer and store it.

How iEDPS can complement Data lake security on AWS?

iEDPS can be used as a data protection solution. Data kept in data lake comes from multiple sources in an organization. This data also contains PII, thus data at rest need to be safeguard from unauthorized access. iEDPS offers 180+ algorithms which can be used to protect the data at rest.

An AWS data lake architecture has three zones:

1. Data Ingestion Zone:

This is the area where all the raw data comes into data lake, from all the different sources within the enterprise. No modelling or extraction should be done at this stage.

iEDPS can discover sensitive data before loading data into the data lake. Running iEDPS discovery on all data sources which needs to be in ingestion zone can help the customer to gain knowledge about sensitive data lying in data sources. Once this is identified customer can use static masking to mask the sensitive data before loading it in S3 data lake.

2. Processing Zone-Catalogue and Search:

In processing zone, we need to extract metadata and catalogue it to make it usable for different applications. Lambda function written to extract metadata helps to ease this process.

Once cataloging is done, we can look at data processing solutions, which can be different based on what different stakeholders want from the data. Amazon's EMR is a managed Hadoop cluster that can process a large amount of data at low cost.

iEDPS can provide data privacy solution to stakeholders in this phase. The data present in data lake can be de-identified or obfuscated using iEDPS core engine at a very high-speed using Amazon's EMR or any Hadoop cluster running on cloud or on premise.

3. Production Zone -- Serve Processed Data:

Once de-identifying of data is done, the data lake is now ready to push out data to all necessary applications and stakeholders. So, you can have data going out to legacy applications, data warehouses, BI applications, and dashboards. This can be accessed by analysts, data scientists, business users, and other automation and engagement platforms.


Data in data lake is always at risk of exposure to unauthorized user or application access.
Standard encryption techniques can fail if keys are compromised. If we can mask or obfuscate the data residing in data lake, we can achieve complete data protection in data lake.

iEDPS provides capability to de-identify data of different formats on data lake. De-Identification can be called on data objects using webservices, standalone product or in batch-mode on Hadoop according to the data needs to be processed. Also, iEDPS has capability to work on AWS EMR/Hadoop cluster which make it more flexible for stakeholders to run iEDPS over data when required. 

With the experience of many years in data protection segment and data governance laws, you can count on iEDPS for securing your data lakes.

Written By : Nashaa Taj

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.