Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« Protecting logs in digital app | Main | Designing data for deep learning in discovering images »

How to design data privacy controls for your legacy data?


Data Privacy is now a necessary need for every organization. More often over time organization creates and sits on the pile of data. With changing times data privacy by design is more suitable aspect with upcoming technologies, where privacy control is inbuilt whilst developing technologies to the market. The problem arises where there is need to implement privacy controls over legacy data, which is still a nightmare for many organizations. To understand why this is not easily digestible we need to understand what legacy data is.


Legacy data are the data(s) that are created over time and lies as unmanaged, old flat files and spread across different format. Data stored by legacy applications have long surpassed their end-of-life (EOL) which means the vendor will not support it anymore. It can be anything like images, charts, audit-report etc. If it contains sensitive elements organizations bound to protect it.


Data Privacy Controls

Controlling data in accordance with privacy governance rules enforces to implement effective methods which protects data based on the below classification

  1. Identity and Access Management: Find who can access what, work on minimum accessibility models.
  2. Data Loss Prevention: Guarding against any breach or leakage.
  3. Encryption & Pseudonymization: Encrypt or hide sensitive data when accessing data.
  4. Incident Response Plan: An operative plan to recover or minimize the risk from any leakage or breach.

While data classification and protection are still a major challenge to all organizations. Complexity rises many folds especially dealing with legacy data. Due to its nature of wide range of formats, storage types one method to handle these can't be effective any way. One can say or argue, since the legacy data are the old ones stored in redundant fragment over many places stored in an organization, simply get rid of it, purge it or shred it. Its not as simple as we speak, those data are still needed to this day in very organizations. Also, we might be thinking of if those are required then why can't just migrate all those to modern manageable data(s) where maintaining and adhering to privacy governance laws will be quite easier. It seems quite appealing to the business but doing so an organization have to spend some good resources and money towards managing with basically no return at the end, as these data are not sought on day to day basis. Organization are reluctant in spending much time, money and other resources. Then the problem arises how to deal with legacy data. How an organization still adhere to all data governance policy enforcement? Well exact answer is still debatable or in other words there can never exists a solution which will fit all types of legacy data privacy controls however we can generalize the approach which eliminates and eases the approach in dealing with such.

How to deal?

So far, we understood our problem and the legacy data. What approach should an organization follow to protect legacy data? Data Privacy Control for legacy data is never straight forward. More likely Organization priorities are to protect data rather to efficiently re-arrange or re-manage. Well, we have now a good understanding what legacy data constitutes, if we classify legacy data over different categories it would help us in managing them. Classification cannot be based on organizations internal data authority like employee data, survey data, finance data etc. To have an effective privacy control we need to classify these in terms of what they are really.


First understand what legacy data is: We already been through definition, now we look in real world examples of what they are based on their type. If we understood clearly then we can picture it. Legacy data are the one listed below but not limited to.

  • Flat Files: Like logs, text, xml, json etc.
  • Well Known document types: Like word document, excel, presentation doc, pdf etc.
  • Images & Charts: Any scanned documents and pictures.
  • Other Binary types: These are the data generated by third party software have long surpassed their end-of-life (EOL) which means the vendor will not support it anymore.

All the above data are unstructured and applying data privacy control is a real nightmare.

It is obvious that this cannot be achieved manually. We need to look in the market if there is any available utility or program which can be useful here, one such utility is IEDPS which stands for "Infosys Enterprise Data Privacy Suite". This suite offers much more capability in terms of "Data Privacy", here we will limit the use for our relevant needs which we discuss so far.

What it does is, it offers a branch of "Unstructured Data Discovery" (UDD) which scans all the legacy data and find PII, HRCI (Highly restricted confidential information) and generates the report letting separate the files and data which contains sensitive information.

Scope of this program is limited to flat files and well known types, however for other binary types data which are generated by custom software from any vendor cannot be dealt with in any case, since this is binary format and is very specific to the client requirement and cannot be readable by any other sources.



There are many other competitive tools are available in the market, and its completely based on an organization requirement. Nowadays every organization created all sorts of data ranging from structured and un-structured data etc. Handling all types of format is real hectic job to do. IEDPS however provides wide range of dB connectors making it easy to work hassle free. Please note that this is subjective no tools can support all data types of an organization because we all are aware of that many organizations develops their own dB engine to meet their custom requirement, and their API interface are limited. IEDPS is in market for over a decade now and it keeps getting better and better over the time, it keeps adding all the emerging technologies with data privacy tools.



So, we have seen how we can define privacy controls on legacy data. We also seen the challenges with it. Here is a short note what we have discussed so far. First classify file types category. Execute a privacy assessment tool on these data set and find which contains sensitive data and then handle these. Also, we have discussed about a tool "IEDPS" which assist in discovery and marking sensitive fields.

Author: Amit Sinha

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.