The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« Gamification - shifting the center of gravity from 'systems' to 'people' | Main | Challenges in M2M »

What will you do about a Data Virus?

Data Virus - data that has been purposefully manipulated to render operations on an entire data set flawed, and it perpetuates its induced error

Large Scale Information pollution will become a massive problem in 5 years and, as practitioners in the space, we must provide tools and capabilities to not only know when our data is polluted or infected, but how to roll it back out.  Data quality certifications are great, but certifications don't fix the insitu data that your enterprise has ingested, they just provide a level of confidence about the integrity of the data when you get it.

Information exchange and usage, thanks in part to the budgets allocated to Big Data initiatives, will become easier.  This lower barrier of entry will promote promiscuous data exchange, opening up an organization to the risky proposition of Information Pollution.  Business users will be especially susceptible to this risky behavior as they want to play around with data, endangering their enterprise in the process.

Scenarios wherein there is an explicit attempt to corrupt the quality of data (e.g. a data virus), especially by minute adjustments, will be especially damaging.  These small adjustments may slip through standard statistical techniques that we can us to help guard our systems integrity.

Imagine a scenario wherein a supply chain is being fed synthetic data.  This synthetic data happens to be composed of weather information, part failure rates, and weather biased structural integrity models.  At some point, in this scenario, it is learned that the weather information has errors in it.  How does one go about containing the information and data?  How does one go to the downstream MRP-II systems and work force management systems that have consumed the synthetic data and undo them? How do all the revenue forecasting models get adjusted?

Techniques for detecting a data virus and techniques for data containment shall be explored in subsequent blogs.  This is an exciting and important topic for all enterprises in this Big Data world.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on