Pragmatic Data Quality Approach for a Data Lake
On 26th Oct 2016, we have presented our thought paper at the PPDM conference hosted in Calgary Telus Spark science center(Calgary Data Management Symposium, Tradeshow & AGM)
With the increase in amount of data produced from sensors, devices, interactions and transactions,
ensuring ongoing data quality is a significant task and concern for most E&P companies. As a result, most of the systems that are sources of data have deferred the task of data clean-up and quality improvement to the point of usage. Within the Big Data world, the concept of Data Lake which allows ingesting all type of data from source systems without worrying about the type or quality of data, further complicates the aspect of data quality as the data structure and usage is left to the consumer. Without a consistent governance framework and set of common rules for data quality, Data Lake may quickly end up into a Data Swamp. This paper examines the important aspects of data quality within Upstream Big Data context, and proposes a balanced approach for data quality assurance across data ingestion and data usage, to improve data confidence and readiness for downstream analytical efforts.
The key points/messages that we presented were,
1. Data quality is NOT about transforming or cleansing the data to fit into the perspectives...instead it's about putting right perspective to the data....
2. Data by itself is not Good or Bad it's just data, pure in its most granular form
3. Quality is determined by the perspective through which we look at the same data
4. Architectural approach to abstract data from the perspectives or standards and build a layer of semantics to view the same data from different point of views. We don't need to populate data into models (PPDM, PODs etc.) instead we put models on top of the existing data promoting the paradigm of "ME and WE" where each consumer of the data has their view point of the same data. The concept of the WELL can be viewed in reference to Completion, Production, Exploration etc. without duplicating the data in the data lake.
5. Deliver quick value to the business and build their trust on the data in the data lake scenario
Please refer to the below link for the details