Infosys experts share their views on how digital is significantly impacting enterprises and consumers by redefining experiences, simplifying processes and pushing collaborative innovation to new levels

« January 2020 | Main | June 2020 »

March 15, 2020

Understanding the Art and Science of Data Lineage - Key ingredient of Privacy by Design

Imagine entering a museum and looking at one of the exhibits - an object of historic importance which could be a few centuries old. The history of ownership and identity of the object determines its value and authenticity.

This example of a museum exhibit highlights the importance of history and authenticity, and even context and accuracy. Can we apply the same to data in an enterprise-wide business landscape? Yes, the converse is absolutely true. The lack of visibility of source/origin of data can escalate into expensive data privacy breaches and compliance issues causing permanent damage to the brand image of the organization.

What is Data Lineage?

Data lineage is the process of understanding, documenting and visualizing the data from its origin to its consumption. In today's highly regulated privacy world, tracking Data Lineage is critical as many companies are forced to have a good understanding of how data flows through different systems inside and outside their organization and also comply with strict regulatory frameworks such as CCPA, GDPR and other privacy regulations.

When we talk about tracking lineage, the immediate thought it metadata of tables, columns, reports - but the lineage is more about re-imagining business traceability of your data as a living organism which is created in one of your applications grow - transforms in one or more processes, thrives in multiple data sources and finally is deleted or archived in your data store. Tracking this lifecycle with the business context is what comprises of data lineage.

Any application tracking the lineage should focus on answering the 5W of data lineage:

W1: Who is using the data?

W2: What does the data mean (to a user, to an auditor, to BI Analyst, to an enterprise architect)?

W3: Where does it exist and where does it come from?

W4: When was it captured and how did it change over time?

W5: How is it being used now (or with respect to a specific time frame)?

Why now? - Significance of Data Lineage

Data Lineage empowers organizations with a clear understanding of where the data comes from, who uses it and what is being used for. A recent survey conducted by O'Reilly on the state of data quality in 2020 indicates a very low 20% of the organizations only manage to publish clear data lineage policies in their internal guidelines. As more and more regulations come into place such as GDPR, CCPA the amount of data collected continues to rise, a world without data lineage is going to chaotic and a compliance nightmare.

One more common problem for organizations is that there are too many data sources. This causes teams to manually define catalog and integrate data sources. It might take months to document the key data flows. This could consequently delay the business decision-making process and less time for strategic initiatives creating inefficiency across the complete value chain.

Key Practical applications of Data Lineage

Data lineage not only improves efficiency and accelerates time to insight, but it also, helps an organization's bolster their regulatory compliance. In today's world, there three key areas where we see the need for Data Lineage evolving:

  1. Privacy by Design - From GDPR to CCPA, data regulations are being passed by governments from countries across the world to safeguard both enterprises and their customers from an increasingly data-driven world. But, we will feel companies are only catching up - In 2019, around 5 billion dollars were paid on fine due to improper handling of customer data. This is one of the key areas where Data Lineage will evolve in the next few years as a key component of Privacy by Design to enable an organization to meet compliance and regulatory standards
  2. Cost Optimization - There is so much focus on migrating from Legacy to modern data sources, there is a need for cost optimization through lineage. Attribute-level lineage will be further enhanced to capture transformation details on the processes. This will enable us to get a cross-section of cost occurred in the target data source when we move scripts, processes, and triggers from the source.
  3. Data Ops - Optimizing code is not enough, there would be a lot of focus on optimizing data pipelines to reduce the cycle time of data governance. There will be agile development into data governance so that data teams and users can collaborate and work more effectively. 

Through strong data lineage enterprises can take control of the value of their data, protect their end-users privacy, meet compliance requirements and justify the value of their data.


Subscribe to this blog's feed

Follow us on

Blogger Profiles

Infosys on Twitter