Clients are, or soon will be, ingesting all sorts of data thanks to information brokerages and the Internet of Things (IoT) and processing that data in novel ways thanks to the Big Data movement and Advanced Analytics. Decisions made through business intelligence systems require that the data being used is trusted and of good quality. How will companies ensure that the data being ingested and acted upon is untainted? This has been an interest of mine as I work to protect the integrity of my clients' decision making processes and systems.
Last year I shared a forward looking concern about the concept of a data virus: data that has been purposefully manipulated to render operations on an entire data set flawed, and it perpetuates its induced error. As noted in the original What will you do about a Data Virus? blog, a tricky situation arises when data fed into the enterprise is determined to be corrupted. How do you unroll all the down stream systems that have made decisions based on the bad data? Maintaining this data contamination is tricky. Many legacy enterprise systems simply don't have the ability to "roll back" or "undo" decisions and/or persisted synthetic information. So, the first and obvious line of defense is blocking, or sequestering, suspect data before it enters the enterprise. Much as a network Firewall blocks suspect requests to ports or machines in your network, a similar concept can be employed..... a Data Virus Guard if you will .... in many situations as a first line of defense.
Please keep in mind that my focus has been on streaming sources of data, which are typically sensor based (maybe a velocity reading, or temperature, or humidity, or ambient light, ...) and associated with a thing (a car, train, or airplane for example) and comes in for processing in a streaming manner. What I'm sharing in this blog could be applied to other kinds of "streaming" things such as feeds from Social Web systems, for example.
What is a Data Virus Guard?
A Data Virus Guard is a logical unit that has the responsibility of identifying, annotating, and dealing with suspicious data.
Where should a Data Virus Guard be deployed?
A Data Virus Guard should be deployed at the initial ingestion edge of your data processing system, within the data capture construct. The data capture sub-system normally has the responsibility of filtering for missing data, tagging, and/or annotating anyway so it is the perfect location to deploy the Data Virus Guard capability. If you identify and contain data at the "edge", then you run less risk of it containing your enterprise.
How do you Identify a Data Virus?
This area of the Data Virus Guard is what drew my research interest .... how do you go about discerning between normal data and data that has been manipulated in some way? The approach that I've been taking is focusing on steady state data flows because I'm interested in a generalized solution, one that can work in most cases. If one can discern what constitutes steady state, then deviations to steady state can be used as a trigger for action. More elaborate, and case specific, identification approaches can be created and placed easily with the framework I'm proposing.
What kind of Annotation do you do?
As data enters into an enterprise, ideally there is meta-data that helps with maintaining data lineage. That is, what was the source system that produced the data, what is the "quality" of the data, when was the data generated, when did the data enter the enterprise, is it synthetic (computed versus a sensor reading), etc. etc. Added to this could be an annotation that indicates which Data Virus Guard algorithm was applied (model, version), and the resulting score of likely suspicion.
How would the Data Virus Guard deal with suspect data?
Based on the rules of your data policies, the data judged as suspect may be set free to flow into your enterprise, discarded as if it never existed, or kept in containment ponds for further inspection and handling. In the former case, if you let it in the enterprise and it was annotated as suspect, when data scientists work with the data, they will see that it is suspect. If you have automated algorithms that make decisions, they could use the suspect score to bias the thresholds of making a choice.
What are characteristics of a Data Virus Guard?
In the search for "the best ways" to guard against a data virus, a few criteria have popped out to make the system practical. Firstly, it has to work on all common types of data. To be truly useful in an enterprise setting, the Data Virus Guard can't work with only strings or only integers, it must work on all common types to provide true utility. Secondly, its determination of suspicious or not data must be very fast. How fast? As fast as practically possible as the half-life of data value is short. This is a classic "risk vs reward" scenario, however, and can be done on a scenario by scenario basis. Thirdly, it must have the ability to learn and adjust on its own of what constitutes normal, or not-suspicious, data. Without this last capability, I suspect enterprises would start strong with a Data Virus Guard, but then it would find itself out of date as other pressing matters would trump updating the Data Virus Guard with the latest Data Virus identification models. In summary, it must work with all types of data, it must be fast, and it must learn on its own.
How would you implement a Data Virus Guard?
Putting together a Data Virus Guard can be a straight forward endeavor. By blending a stream processing framework with a self-tuning "normal" state algorithm, it would be possible to identify, and annotate, data flows that deviate from some norm (be it values, range of values, patterns of values, times of arrival, etc.). One could envision a solution coming to life by using, for example, Storm, the open source streaming technology that powers Twitter, and a frequency histogram implemented as a Storm "bolt" (the processing unit of a Storm network) to discern out of norm conditions.
Admittedly, the usage of a frequency histogram would create a weak Data Virus Guard, but it would get the Data Virus Guard framework off the ground and be easy to put in place. However, by using Storm as the underlying stream processing framework, swapping in a more powerful "out of norm" algorithm would be relatively easy. Do you go with a Markov chain, a Boltzmann machine, or even the very interesting Hierarchical Temporal Memory approach of Numenta? This would all depend upon your system, the characteristics of the data you're ingesting and the amount of false-positives (and false-negatives) your enterprise can withstand. Of course, you even go further and apply all three of the approaches and come up with some weighted average for discerning if some piece of data is suspicious.
This is a forward looking post about what we can expect to be issues in Enterprises as all companies embrace the concepts of Big Data, Advanced Analytics, the Internet of Things, and true Business Intelligence: a Data Virus, and what we can do about it: a Data Virus Guard. My work in this area is still evolving, and is intended to keep our clients a few steps ahead of what's coming. Bad data plagues all enterprises. It can be incomplete, malformed, incorrect, unknown, or all of these. Unfortunately, we now also have to watch for malicious data. Putting in safeguards for this condition now before the malicious data issue becomes rampant is a much cheaper proposition than re-hydrating your enterprise data stores once a contamination occurs. If nothing else, if you don't implement a Data Virus Guard, be sure you have your data policies in place for addressing this coming issue.