Ensure the Quality of Data Ingested for True Insights
Author: Naju D. Mohan, Delivery Manager
I sometimes wonder whether it is the man's craze for collecting things, which is driving organizations to pile up huge volumes of diverse data at unimaginable speeds. Amidst this rush for accumulating data, the inability to derive value from this heap of data is causing a fair amount of pain and a lot of stress on business and IT.
Quality of incoming data and traditional data storages
In the world of traditional data warehouse, we have mainly dealt with structured data. Most of the times, the data was extracted from operational systems and transformed to adhere to the business rules. For analytical purposes or for further downstream processing, this data was loaded into a data warehouse. Terabytes of data was talked about as huge volume in the world of data warehouse. In enterprise data warehouse, operational data stores and traditional data storage mechanisms, the emphasis of data quality check started at the point of data entry. This was justified because the impact of bad data on business was much more severe when compared to the cost spent on cleansing the data during data acquisition.
Today, we are talking about zetabytes of data and this doubles almost every 1.2 years. With this massive increase in the volume and velocity of data generation, the need to transform and cleanse the data at the point of import has been compromised. We have various business situations where we acquire data through batch processes, with real-time data acquisition mechanisms or even streaming of data into the big data ecosystem.
Is incoming data quality still sacrosanct for the modern big data systems?
Big data ecosystems do not necessarily require record-by-record data quality validation during data ingestion. Let us take a sneak peek at the three primary ways to ingest data into a big data ecosystem and the data quality checks to be executed for safeguarding the business value derived from data at rest as well as from data in motion.
Batch data ingestion
The data from various source systems are typically available as files. These files can vary - text, binary, image, etc. Once these source files are available, they can be ingested into the big data system with or without doing transformation at the point of ingest. This becomes an efficient form of processing huge volumes of data, accumulated over a period of time, which is typically the case in big data implementations.
Real-time data ingestion
Real-time data ingestion is not about storing and accumulating the data and later, batch processing it to move into the big data system. Rather, it deals with moving data into the big data systems as and when it arrives. Just like, there is an argument that there is no such thing as totally unstructured data, a similar saying goes that there is nothing such as pure real-time data ingestion. Real-time data ingestion stresses on the fact that the data is ingested in the present and not in the future. The term real-time varies from an Online Retailer to a Wall Street Broker to Aircraft Controls.
Streaming data ingestion
This form of data ingestion is very similar to real-time data processing, but data is processed based on incoming data flows. The data would be continuously flowing in and insights are generated as the data flows in. This is often necessitated by the business who wants to move away from the paradigm of just reporting an incident, to predicting events, and ultimately to changing the outcomes.
Quality validation to avoid data ingestion becoming data indigestion
Each batch carries a huge volume of data and hence, any failure in ensuring data quality of the incoming batch data would wreak havoc with accumulating volumes in future batches. I am listing down a few recurring data quality issues which I have observed in batch data ingestion.
- Failure of a few jobs in the entire incoming data flow could impact data quality. It has to be checked whether the entire batch data has to be discarded or if there are selective approaches to process the data.
- Validate the methods adopted for data acquisition and storage, since the way the same data gets stored in a relational database to a file and finally in a NoSQL Data Store could cause data corruption.
Business often demands quick and smart data analytics. This necessitates real-time data ingestion, which often deals with massive volumes of data that degrades in value if not consumed quickly. Most of these situations would demand data transformation and enrichment, before it is loaded into the big data systems. To avoid data quality degradation during real-time data ingestion, the below mentioned common data quality validations have to be followed:
- Detailed traceability tests to source systems become cumbersome and cost-ineffective due to data transformations. This include necessary statistical validations that needs to be taken to ensure error-free data movement into big data landscape.
- Data duplication can happen due to accumulation of data from various sources. It is necessary to ensure that one has performed the data de-duplication validations.
Sometimes, business decisions are made based on the log data from various sources or event-based data from various systems, which necessitate streaming data ingestion. The presence of even a minute error in the incoming stream of data would impact the real-time dashboards, related analytics, and operations. Commonly used data validation approaches to address the frequent data quality issues encountered in streaming data ingestion are listed below.
- The format of incoming streaming data should be easily comparable with existing historical data for meaningful insights. Validate the data format of various streams to avoid data mis-representation.
- Business rules like calculation of running averages in the incoming data stream have to be validated.
We reap what we sow and hence, the acceptability of big data insights would primarily depend on the quality of data used for deriving those insights. We have to walk through the customer journey and decide the strategy to validate the incoming data quality along with the velocity to process the incoming data. In summary, we should collect and process only that data which drives action and leads us in the right direction.