Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« PROTECTION OF DATA IN THE EDUCATION SECTOR AND WHY IT IS IMPORTANT IN THE POST PANDEMIC WORLD | Main | Protecting logs in digital app »

Data Discovery in Large Data Volumes


What is Large Data Volume, how data discovery in large data volume helps?

 With the increase in the amount of social media and most businesses moving online, there is a significant rise in the production of data. For example, we can consider searching for a product on any of the online shopping websites, later we will be getting ads and suggestions for the product. It is because, the moment we search for it, data is getting generated and this data is used later to provide meaningful ads and suggestions. The data which is huge in volume and still grows exponentially is referred to as Large Data Volume. Statistics show that data is generated in such a way that from now on, the total amount of data will double every two years. The size and complexity of this large volume of data are so high that it is very difficult to process the data using existing traditional approaches. 

The large volume of data if processed properly can provide breakthroughs for companies in various fields. To speak of an example, when a customer enters a bank, the data analyst could use the large data volume to check the customer's profile and can understand the preferences and likes of the customer. This would help the bank to provide the customers with relevant offers and products the customer might be willing to choose. If this can be applied to all the customers, then the revenue of the bank can rise significantly. 

Problems related to growing volumes of data and how privacy is compromised:

 On the contrary, there is a tradeoff between data security and the growing large volume of data. If sensitive information such as customer's personal details, bank account number, credit card details, etc. can be accessed by others and if it's shared, then it's a data privacy breach. If this sensitive data is sold to any other companies purposefully by an employee or if data leakage occurs on the internet unknowingly, or when data is shared for developing and testing of applications from third-party companies, then the privacy of the data and customer is compromised very much since the customer is unaware about this privacy breach.

What is data discovery in LDV?

  To avoid this, data discovery can be performed in the data and the files having sensitive information can be protected by passwords or only restricted access can be given to them. If data needs to be given to a third-party company for testing purposes, the identified sensitive data can be masked so that the testing can go on without any privacy violation. It helps to identify personal information, and sensitive data according to the environment it is used in. If it is used in banks, then sensitive information such as credit card numbers, account numbers, etc. can be identified from the files. Data discovery helps in scanning the data present inside the system and identifying the sensitive information from them. Artificial intelligence and Machine learning go beyond scanning just the metadata.

Data discovery in small data sets and why is it difficult for large data sets?

   Most of the data scientists and data analysts employ Python for data preprocessing and building models. The libraries that are commonly used are Pandas, NumPy, sci-kit-learn, etc. These libraries work on a single CPU and are not scalable. It can fail while processing large datasets since it may not fit into the RAM that is available which results in heating and slowing down of the machine. To handle such large data volumes in python, there are some libraries available in python namely VAEX, KOALAS, DASK, among which DASK being the most efficient one.

Data discovery in large datasets using DASK:

  It can efficiently perform parallel computations on a single machine using multi-core CPUs. It stores data on the disk and uses chunks of data from the disk for processing so that less memory will be used for computations. The values which are generated during the processing of the data are dumped at the end of the process.

   To cite an example, there are 4 cards (different colors) in a table. The task is to separate the 4 cards according to their colors, provided only a single person can work on them. If the number of cards is raised to 100 and then to 1000, the same task would be difficult to perform by a single person. If the same task is split among multiple people, then it would be completed easily. This resembles the scenario of discovery in large data volumes as well, the individual working alone is pandas, NumPy, etc. whereas the ones working together resemble DASK. DASK can also run on a cluster of machines to process the data efficiently. In case of a mismatch between the number of cores among the machines in the cluster, DASK will be able to handle these variations internally. Since it supports Pandas data frame and NumPy data structures, it's easier to process large data sets with negligible differences in the coding format.

How is it handled in iEDPS?

  Infosys Enterprise Data Privacy Suite(iEDPS) offers a solution to avoid data breaches by discovering sensitive information in both structured and unstructured data. Data discovery is performed using deterministic and probabilistic techniques using machine learning along with a multi-threaded pipeline approach.

  Python processes get stuck or sometimes it fails to complete while handling files of large volume. To avoid this scenario, in iEDPS by leveraging the library DASK, data from large files are processed in less time without any fail. After performing discovery, the user can either view the sensitive information from the report or can mask the original sensitive information on the files by using different masking techniques provided by iEDPS.


About the Authors:

•Visakh Padmanabhan is a Systems Engineer - Python Developer from iEDPS Data Discovery team.

•Gayathri Nadella is a Technology lead - AI Professional from iEDPS Data Discovery team.

 

 

 

 




Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.