What will you do about a Data Virus?
Data Virus - data that has been purposefully manipulated to render operations on an entire data set flawed, and it perpetuates its induced error
Large Scale Information pollution will become a massive problem in 5 years and, as practitioners in the space, we must provide tools and capabilities to not only know when our data is polluted or infected, but how to roll it back out. Data quality certifications are great, but certifications don't fix the insitu data that your enterprise has ingested, they just provide a level of confidence about the integrity of the data when you get it.
Information exchange and usage, thanks in part to the budgets allocated to Big Data initiatives, will become easier. This lower barrier of entry will promote promiscuous data exchange, opening up an organization to the risky proposition of Information Pollution. Business users will be especially susceptible to this risky behavior as they want to play around with data, endangering their enterprise in the process.
Scenarios wherein there is an explicit attempt to corrupt the quality of data (e.g. a data virus), especially by minute adjustments, will be especially damaging. These small adjustments may slip through standard statistical techniques that we can us to help guard our systems integrity.
Imagine a scenario wherein a supply chain is being fed synthetic data. This synthetic data happens to be composed of weather information, part failure rates, and weather biased structural integrity models. At some point, in this scenario, it is learned that the weather information has errors in it. How does one go about containing the information and data? How does one go to the downstream MRP-II systems and work force management systems that have consumed the synthetic data and undo them? How do all the revenue forecasting models get adjusted?
Techniques for detecting a data virus and techniques for data containment shall be explored in subsequent blogs. This is an exciting and important topic for all enterprises in this Big Data world.