The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

Main

December 16, 2013

Data Virus Guard

Clients are, or soon will be, ingesting all sorts of data thanks to information brokerages and the Internet of Things (IoT) and processing that data in novel ways thanks to the Big Data movement and Advanced Analytics.  Decisions made through business intelligence systems require that the data being used is trusted and of good quality.  How will companies ensure that the data being ingested and acted upon is untainted?  This has been an interest of mine as I work to protect the integrity of my clients' decision making processes and systems.

Last year I shared a forward looking concern about the concept of a data virus: data that has been purposefully manipulated to render operations on an entire data set flawed, and it perpetuates its induced error. As noted in the original What will you do about a Data Virus? blog, a tricky situation arises when data fed into the enterprise is determined to be corrupted.  How do you unroll all the down stream systems that have made decisions based on the bad data?  Maintaining this data contamination is tricky.  Many legacy enterprise systems simply don't have the ability to "roll back" or "undo" decisions and/or persisted synthetic information.  So, the first and obvious line of defense is blocking, or sequestering, suspect data before it enters the enterprise.  Much as a network Firewall blocks suspect requests to ports or machines in your network, a similar concept can be employed..... a Data Virus Guard if you will .... in many situations as a first line of defense.

Please keep in mind that my focus has been on streaming sources of data, which are typically sensor based (maybe a velocity reading, or temperature, or humidity, or ambient light, ...) and associated with a thing (a car, train, or airplane for example) and comes in for processing in a streaming manner.  What I'm sharing in this blog could be applied to other kinds of "streaming" things such as feeds from Social Web systems, for example.

What is a Data Virus Guard? 
A Data Virus Guard is a logical unit that has the responsibility of identifying, annotating, and dealing with suspicious data. 

Where should a Data Virus Guard be deployed?
A Data Virus Guard should be deployed at the initial ingestion edge of your data processing system, within the data capture construct.  The data capture sub-system normally has the responsibility of filtering for missing data, tagging, and/or annotating anyway so it is the perfect location to deploy the Data Virus Guard capability.  If you identify and contain data at the "edge", then you run less risk of it containing your enterprise.

How do you Identify a Data Virus?
This area of the Data Virus Guard is what drew my research interest .... how do you go about discerning between normal data and data that has been manipulated in some way?  The approach that I've been taking is focusing on steady state data flows because I'm interested in a generalized solution, one that can work in most cases.  If one can discern what constitutes steady state, then deviations to steady state can be used as a trigger for action.   More elaborate, and case specific, identification approaches can be created and placed easily with the framework I'm proposing.

What kind of Annotation do you do?
As data enters into an enterprise, ideally there is meta-data that helps with maintaining data lineage.  That is, what was the source system that produced the data, what is the "quality" of the data, when was the data generated, when did the data enter the enterprise, is it synthetic (computed versus a sensor reading), etc. etc.  Added to this could be an annotation that indicates which Data Virus Guard algorithm was applied (model, version), and the resulting score of likely suspicion. 

How would the Data Virus Guard deal with suspect data?
Based on the rules of your data policies, the data judged as suspect may be set free to flow into your enterprise, discarded as if it never existed, or kept in containment ponds for further inspection and handling.  In the former case, if you let it in the enterprise and it was annotated as suspect, when data scientists work with the data, they will see that it is suspect.  If you have automated algorithms that make decisions, they could use the suspect score to bias the thresholds of making a choice. 

What are characteristics of a Data Virus Guard?
In the search for "the best ways" to guard against a data virus, a few criteria have popped out to make the system practical.  Firstly, it has to work on all common types of data.  To be truly useful in an enterprise setting, the Data Virus Guard can't work with only strings or only integers, it must work on all common types to provide true utility.  Secondly, its determination of suspicious or not data must be very fast.  How fast?  As fast as practically possible as the half-life of data value is short. This is a classic "risk vs reward" scenario, however, and can be done on a scenario by scenario basis.  Thirdly, it must have the ability to learn and adjust on its own of what constitutes normal, or not-suspicious, data.  Without this last capability, I suspect enterprises would start strong with a Data Virus Guard, but then it would find itself out of date as other pressing matters would trump updating the Data Virus Guard with the latest Data Virus identification models.  In summary, it must work with all types of data, it must be fast, and it must learn on its own.

How would you implement a Data Virus Guard?
Putting together a Data Virus Guard can be a straight forward endeavor.  By blending a stream processing framework with a self-tuning "normal" state algorithm, it would be possible to identify, and annotate, data flows that deviate from some norm (be it values, range of values, patterns of values, times of arrival, etc.).  One could envision a solution coming to life by using, for example, Storm, the open source streaming technology that powers Twitter, and a frequency histogram implemented as a Storm "bolt" (the processing unit of a Storm network) to discern out of norm conditions.

Admittedly, the usage of a frequency histogram would create a weak Data Virus Guard, but it would get the Data Virus Guard framework off the ground and be easy to put in place.  However, by using Storm as the underlying stream processing framework, swapping in a more powerful "out of norm" algorithm would be relatively easy.  Do you go with a Markov chain, a Boltzmann machine, or even the very interesting Hierarchical Temporal Memory approach of Numenta? This would all depend upon your system, the characteristics of the data you're ingesting and the amount of false-positives (and false-negatives) your enterprise can withstand.  Of course, you even go further and apply all three of the approaches and come up with some weighted average for discerning if some piece of data is suspicious. 

Summary
This is a forward looking post about what we can expect to be issues in Enterprises as all companies embrace the concepts of Big Data, Advanced Analytics, the Internet of Things, and true Business Intelligence: a Data Virus, and what we can do about it: a Data Virus Guard.  My work in this area is still evolving, and is intended to keep our clients a few steps ahead of what's coming.  Bad data plagues all enterprises.  It can be incomplete, malformed, incorrect, unknown, or all of these.  Unfortunately, we now also have to watch for malicious data.  Putting in safeguards for this condition now before the malicious data issue becomes rampant is a much cheaper proposition than re-hydrating your enterprise data stores once a contamination occurs. If nothing else, if you don't implement a Data Virus Guard, be sure you have your data policies in place for addressing this coming issue.

January 23, 2013

Why don't Bees Teleconference while Building a HIVE?

Self Organization in Teams-Learnings from Nature

J. Srinivas, Shilpi Jain, Sitangshu Supakar

SO_1.jpgWhat do pack of wolves, pride of lionesses, bees and ants have in common. What can we learn from them? What is self-organization (SO) and how does it form?

We are exploring different ways to induce this behavioral skill in the team members for greater commitment, motivation and accountability to the work. Many of us think, what is so great about it; we are self-organized and perform our daily course without fail. But, the question is can we perform equally well in a project, during crisis or with reduced resources.

NATURE has tuned the self-organized system. Be it the conduct of animals, insects, or eco-system, nature organizes optimally. What are the attributes of self-organization derived from the nature?  Can project teams organize themselves, the way nature does? Is it meaningful to compare the dynamism of NATURE with the dynamism that organizational teams face?

Before finding answers, let's understand with few examples how self-organization is an adaptive attribute in animals and insects. Imagine how the pack of animals like wolves and lionesses hunt? How honey bees organize their affairs so well in their hive and devote themselves to the welfare and survival of their colony?

Wolves are known for their intelligence and social behavior. They organizeSO_2.jpg themselves for the hunt and care of their group. The motive of the pack is to be as successful as possible, no matter if they are not the strongest one. The whole objective is to make their hunt a success so that every member can get the sufficient food. Each wolf in the pack plays a role. There is always a leader in the herd (pack) but while hunting, it rarely interferes or directs its fellow animals (Michael, Wolf., 1995-2005). Another interesting thing about them is their sense of communication; they follow communication protocol and communicate in many ways (body language, gesture, and expression). The selection of communication mean is highly dependent on the distance between the two wolves. If they are close to each other the communication is non-vocal. Similarly when they are in large group, they do 'Mob-greetings'.

They share a common objective - food for the pack. They have communication protocols and established patterns for hunting, individuals know how to respond to change to meet the objective. Their play mirrors the hunt patterns.

Let's see how bees organize themselves and find the flower nectar. Bees are deaf hence they perform a series of movements called as 'waggle dance'. These dancing steps help to identify the source of nectar and also teach other workers about the location of food source 150 meters away from the hive. The bees have orchestrated movements for communication. Especially when they are hunting for flower nectar, the experienced bees walks straight ahead, vigorously shaking its abdomen and producing a buzzing sound with the beat of its wings (Debbie, 2011). The distance and speed of this movement communicates the distance of the food site to the other bees. Another exciting aspect is the group size, the bees' colony size varies from 20000 to 80000 worker bees and they all work in coordination with each other without much direction and guidance.

The above examples of honey bees too display those benefits of the self-organization concepts discussed above. Adherence to shared objective set of practices, pattern of behavior, and communication. They show the benefits of self-organization, i.e. commitment, efficiency, and achieving self-sufficiency for the community.  Members of the community organize themselves repeatedly and continuously to meet changing requirements.

In a Direct communications with partners, iterative processes helps control conflicting interests and help them to adapt quickly to unpredictable and rapidly changing environments (Monterio et al., 2011).

  • In a research conducted by Hoda et al. (2011), it was proved that "balancing freedom and responsibility, balancing cross-functionality and specialization, balancing continuous learning and iteration pressure uphold the fundamental conditions of self-organization at certain level."

Agile manifesto stresses on self-organizing teams, and we explored what techniques make the teams achieve a sense of teamness and spontaneous adaptability which makes it work in short sprints and what will make it work in the long run. In subsequent blogs we will learn how the concepts of self-organization can be brought in a structured manner and help teams adapt in a changing environment. The resulting framework would help us in recognizing when SO can be formed or in creating the right environment for it.

Our goal is to deconstruct the key concepts of the above examples and apply them in real teams to make it spontaneous and easy to transform into a self-organizing team. Support for the concepts comes from couple of papers we looked at.

REFERENCES

Cao, L., & Ramesh, B. (2007). Agile software development: ad hoc practices or sound principles",. IEEE Computer Society.

Debbie, H. (2011). Honey Bees - Communication Within the Honey Bee Colony. Retrieved September 13, 2012, from About.com: http://insects.about.com/od/antsbeeswasps/p/honeybeecommun.htm

Hamdan, K., & Apeldoorn. (1989). How Do Bees Make Honey? Retrieved September 4, 2012, from A. Countryrubes Web site: http://www.countryrubes.com/D07529EF-066D-494F-A481-AB6EF6A257E9/FinalDownload/DownloadId-886680BD5CAAD9474A1D646219C0FAE6/D07529EF-066D-494F-A481-AB6EF6A257E9/images/How_do_bees_make_honey_update_9_09.pdf

Hoda, R., Noble, J., & Marshal, S. (2011). Developing a grounded theory to explain the practices. (S. S. Media, Ed.) Empirical Software Engineering.

Karhatsu, H., Ikonen, M., Kettunen, P., Fagerholm, F., & Abrahamsson, P. (2010). Building blocks for self-organizing software development teams a framework model and empirical pilot study. International Conference on Software Technology and Engineering (ICSTE), (pp. 297-304). Helsinki, Finland.

Michael, Wolf. (1995-2005). What are Wolves. Retrieved September 4, 2012, from Wolf Ranch Foundation: http://www.wolveswolveswolves.org/WhatAreWolves.htm

Monteiro, C. V., da Silva, F. Q., dos Santos, I. R., Felipe, F., Cardozo, E. S., Andre, R. G., et al. (2011). A qualitative study of the determinants of self-managing team effectiveness in a scrum team. Proceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering (pp. 16-23 ). Communications of ACM.

[1] The image of wolves hunting is taken from the source: http://qpanimals.pbworks.com/w/page/5925166/Grey%20Wolf

[2] The image 'bees at work' is taken from the source: http://openlearn.open.ac.uk/mod/resource/view.php?id=387640


Continue reading "Why don't Bees Teleconference while Building a HIVE?" »

August 3, 2012

Importance of Being Human

Unlike most living species on the earth, human beings are probably the most complicated of the lot. Not only are they the most intelligent species aroung, they are also arguably the most social and civilized beings.

As human beings grow from a child to an adult individual they also become rigid in their thoughts. The enshrining principles that guide an individual are the net products of that individual's immediate family, his/ her neighborhood (society / friends), the education system and access to other cultures (through technology or means of travel). Depending on the strength of each of these parameters the thoughts of an individual take shape. Organizations are also similar to human beings to a major extent. When the organization is small it's similar to a child where its mannerisms are extremely chaotic. Like a child an organization also assimilates all the inputs coming to it from its environment without laying any priorities. The organization lacks clear direction & goals and policies & procedures remain unclear. As the organization matures, similar to a teenager, it also develops clarity of goals and direction and there is a consistency in priorities as well-defined policies and procedures come up. The organization gathers stability at this stage. Finally, like a full grown adult who has his/her unique belief systems and thought processes, an organization has its own well defined values which result in distinctive culture. The organization has a high Performance (Outstanding, sustainable results) at this stage with a clear statement of mission that creates sense of esprit de corp. However, along with initial high performance comes rigidity of values as well.

<="#000000">The similarities don't end here. Just like human beings even organizations also face frequent survival issues and many of them die eventually. A brief analysis of the Fortune 500 list, from 1955 when it was first started, shows that of the 1950+ companies that have made it to the list, only 66 have been able to remain in it consistently. Only the fittest survive. Darwinian principles at WORK, one may say!!

There are many reasons that can be attributed to failure of organizations such as: Managerial errors; ill-informed decisions; Greed/risk; Availability of funding; Corporate culture; Hubris; Distance from reality; Creative destruction. While we may argue that Schumpeterian principles remain a major cause of the death of many organizations, where new innovations create new world order and destroy existing ones. However, a re-look at some of the other reasons mentioned above shows that most of the reasons for failure boil down to failure of people.

A look at the recent 2008 recession shows that out of the 20 biggest corporate bankruptcies ever filed, 8 were filed after 2008 and 6 of those 8 were financial services organizations. Financial Services unlike manufacturing are some of the most direct customer facing services. Add to that, the enormous amount of scrutiny that financial service organizations go through on a regular basis must ensure that they remain lean and fit. The question, therefore, is why did this industry perform so badly during the 2008 recessions? A larger question that we may also seek to answer is why do some organizations survive while the others die?

Financial service industries by their very own nature perform extremely complicated operations, absolutely illegible for the layperson (even with high qualifications). Crises such as Sub-prime lending crisis of 2008, the effects of which the world is still coping with, are so complicated that even the people who designed the mechanisms couldn't understand it. The Black-Scholes equation, the holy grail of investors was the core of financial markets. It opened up a new world of ever more complex investments, blossoming into a gigantic global industry. But when the sub-prime market turned sour, the darling of the financial markets became the Black Hole equation. The downside was the invention of ever-more complex financial instruments whose value and risk were increasingly opaque. Myriad organizations hired mathematically talented analysts to develop similar formulae and in the process created an industry which remained extremely opaque to the outside world.

However, what the smart people disastrously forgot was to ask how RELIABLE the answers would be if market conditions changed i.e. sentiments of people who had invested in those financial products changed.

Was an equation to blame for the financial crash, then? Yes and no. Black-Scholes may have contributed to the crash.  But the mathematical models grossly failed to represent reality adequately. The reality was considering PEOPLE in their complex mathematical equations.

What would Keynes, one of the greatest minds of 20th century, do in today's inordinately poor economic scenario? He would be extremely unhappy, for he was strictly against the idolatry of market economics, which incidentally most of the organizations seem to follow even today. As he described about the market - "the worm that had been gnawing at the insides of modern civilization... the over-valuation of the economic criterion". According to him the market was made for human beings - not human beings to serve the market. He strongly believed that nothing had value except the experiences of individuals.

 What do we learn, therefore, from the recent past as well as from the views of past century? It becomes extremely evident that organizations, how much ever smart they may tend to be, must consider PEOPLE at the core of their offering. Instead of making things extremely complicated and opaque it makes sense for smart organizations to SIMPLIFY their operations. Instead of equations driving them to deliver values, it should be experiences of individuals that drive them to success.

As elaborated in the initial paragraphs, organizations are a sum total of their own experiences, which at a later date lend them rigidity. Organizations become slaves to the same rules that they developed to generate high-performance. They forget the fact that it's not the rules that they are supposed to serve. Instead, it's the people who are to be served, and that is where rules must come from.

This is where organizations are dissimilar with human beings. While human beings have this unique capacity to express and read emotions and adapt themselves, most organizations are incapable of doing so. Thus, organizations must create FLEXIBLE systems, where they can read the emotions of the people they are meant to serve, LEARN from the people and ADAPT.

Flexible systems within organizations must also enable them to trash their own beliefs and challenge their own assumptions.

P.S: Scientific evidence shows Whales are more intelligent than Human beings. But is the self-described "Social Animal" human being ready to challenge its own beliefs????

 

http://www.preservearticles.com/2011102115886/are-we-the-most-intelligent-beings-on-earth.html

http://www.centerod.com/2012/02/3-stages-organizational-development/

http://www.dirjournal.com/business-journal/some-major-us-companies-that-went-bankrupt/

http://www.infosys.com/building-tomorrows-enterprise/Documents/smarter-organizations.pdf

http://www.forbes.com/sites/kenmakovsky/2012/05/31/why-do-companies-fail/

November 23, 2011

Big Data Analytics: The role of granurality of the information model

Gartner, McKinsey and other analysts together have popularized the opportunity of 'Big Data', often referring to the massive amount of content and data available today. It presents an opportunity to develop techniques for sifting and processing data at massive scale such that it can be consumed by users or even by systems and processes (machine-readable and human-readable).This is the opportunity for Big Data Analytics.

What I'd like to describe is the importance having a information model that can unify both unstructured content and structured data and a few principles of designing or selecting IT infrastructure to support this new model.

Time Berner Lee's vision of Linked Open Data (abbreviated as LOD) espouse a vision for Information Management based on RDF. Now RDF is language for conceptual description or information modeling . It is important to distinguish between the logical structure of a Subject-Predicate-Object (SPO) expression and the many serializations of RDF. like RDF/XML, N3 or Turtle.

Why is LOD  important in the context of 'Big Data Analytics'?
LOD espouses 3 primary things.
  1. Make Information Identifiable  - use Http URIs as identifiers for all entities,
  2. Make Information Open and Shareable - store all information in RDF and expose as a service on an HTTP endpoint. (Remember the whole idea of Services in SOA!)
  3. Make Information Linkable - Re-use URIs when you refer to the same entity. Build interlinks to bridge conceptual worlds across networks and systems.
Why is a commitment to the RDF information model absolutely important?
3 reasons - Flexibility, Flexibility, and Flexibility. RDF is schemaless and offers the right model to deal with dynamic information at massive scale. To design a schema around information you always make two assumptions a) that you know enough about a domain and b) that the domain of knowledge itself remains fairly static.The second is hardly true even if you are a 'domain expert' - a phrase and a claim that I have found most salesmen make.

Entity Oriented Models and Indexing of Information
Earlier I mentioned that the Big Data Information Model must be able to unify the structured world (schema and Relational Databases) and the Unstructured world of posts, web pages,  and documents. RDF fills this requirement neatly.

Unstructured text today is primarily processed by Information Retrieval technology - understood by the masses as 'Search' while Relational and XML Databases represent the structured data world. Search relies on an information model that is based on documents and words (or phrases). RDF provides a graph based data model without mandating a schema that now defines entities as the key elements of the information model. The difference between words and entities is that an entity may be known by many words or phrases.

The implications are very many and useful - today search results provides a list of documents or web resources relevant to your query and relevance is primarily 'word' driven.Using RDF, you can now break out the documents and permit information use a the level of entities that are more fine grained.
Some Search users (especially the ones targeted by Enterprise search technology) have precise information need, that is not very well served by system that rely on words in documents as their smallest unit of information. These users want answers and these answers are typically about some aspect or feature of a collection of entities.
Therefore what we need is ability to uniquely identify (or simply tag) entities present in content and data in a consistent way, store information using the SPO model espoused by RDF to capture the various connections between entities or entities and literals to build a massive connected network or information graph and make them findable and reference-able on the web or within some restricted information space.

This kind of an information infrastructure will enable dynamic exploration of context and interaction between entities and also enable punctualization (to borrow terminology from Actor-Network Theory), where entire networks of entities can be studied as a whole (as behaving like a single entity) .This is what I should highlight as the flexibility of granularity in Information Analytics (or Big Data Analytics if you like it that way).


May 27, 2011

Next Gen BI based on Semantic Technology

Here are my thoughts on how BI tools and technology can leverage semantic technology.

  • BI based on the relational model of data is no longer viable with the data complexity exceeding the limits afforded by RDBMS technology. Other then scale, there is also the issue that RDBMS queries are heavily coupled to the physical data organization while semantic querying in SPARQL permits complete abstraction of the physical organization of data. This is extremely useful when executing BI queries over distributed data sources.

  • Another important issue to consider is the opportunity to use inference capabilities provided by semantic repositories in BI analytics to draw conclusions from what is already stated  in the database.

  • The other is the ability to use semantic analysis and information extraction to populate knowledge bases based on semantic technology that can combined structured and unstructured data in a uniform schema less manner providing the benefit of schema flexibility and information linking. This is very useful for information integration.

  • Scalable Consistency checking within BI is now possible by the use of semantic technology. Rule based consistency checking has been around for quite some time. However semantic technology provides  a standards based stack of knowledge languages, data models, and inference engines to facilitate better adoption.

June 2, 2010

NextGen Data Warehousing Trends - Part I

"Necessity is the mother of all inventions" - this quote holds true today as well, except the fact that we are starting to realize the necessities based on the inventions that are shaping up. Data Warehousing is certainly no exception, and over the past years we have seen various avatars of Data Warehousing shaping up organizations, and driving their growth. To name a few - Enterprise Information Management, Operational Intelligence, Real-time/Near Real-time Data Warehousing, BI As a Service (BIaS), in-Memory analytics, Master Data Management etc.

Continue reading "NextGen Data Warehousing Trends - Part I" »

April 29, 2010

Location Intelligence - Part 2

 

Every organization leverage technical innovations towards better understanding of their business markets, customers and competitors with an aim to improve productivity, business performance and generate higher revenue. Location Intelligence is one such technical innovation which involves integrating and analyzing location data to build insights and take active decisons based on market & customers geography.

Continue reading "Location Intelligence - Part 2" »

March 26, 2010

Location Intelligence

Had been recently reading interesting developments around the Geospatial data usage patterns across the globe. Until now it was perceived to be applicable to specialized industries like Oil & Gas, Transportation/Logistics, Mining & Exploration industry. Ever thought about Geospatial data driving intelligence for your business!!!

Continue reading "Location Intelligence" »

February 8, 2010

BI on ECM - Who says not possible?

BI aid business in the decision making by providing different set of analysis on the business data. The assumption till now was that the BI is only possible in the databases as you need structured data to do the analysis but the scenario has rapidly changed in last few years. Companies are using BI on top of ECM to perform different kind of analytics.

Continue reading "BI on ECM - Who says not possible?" »

February 2, 2010

BI Open Source Story - Are we there yet

Recession reminded us of the Darwin's theory of 'Survival of the Fittest', and that's what we are seeing - the companies which have focused on optimizing costs, efficiency and productivity have had their flags high today. Those are the companies which did a balancing act of managing costs and yet retain their best talents in people. Open Source is one horizon those companies are starting to venture out.

Continue reading "BI Open Source Story - Are we there yet" »

January 19, 2010

Agile BI - Why it makes Business Sense

I, like most of the BI dreamers and practitioners who get challenged by latest in the BI world, have been trying to make some sense out of the mad rush towards host of BI solutions which are claiming to reduce latency between data creation to decision taken. Making an effort to simplify the whole perspective of Agile BI.

Continue reading "Agile BI - Why it makes Business Sense" »