The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« How important and critical is to capture Events in business process modeling? | Main | Frame Navigation Policies in web browsers-One reason to upgrade to modern browsers »

Big Data Analytics: The role of granurality of the information model

Gartner, McKinsey and other analysts together have popularized the opportunity of 'Big Data', often referring to the massive amount of content and data available today. It presents an opportunity to develop techniques for sifting and processing data at massive scale such that it can be consumed by users or even by systems and processes (machine-readable and human-readable).This is the opportunity for Big Data Analytics.

What I'd like to describe is the importance having a information model that can unify both unstructured content and structured data and a few principles of designing or selecting IT infrastructure to support this new model.

Time Berner Lee's vision of Linked Open Data (abbreviated as LOD) espouse a vision for Information Management based on RDF. Now RDF is language for conceptual description or information modeling . It is important to distinguish between the logical structure of a Subject-Predicate-Object (SPO) expression and the many serializations of RDF. like RDF/XML, N3 or Turtle.

Why is LOD  important in the context of 'Big Data Analytics'?
LOD espouses 3 primary things.
  1. Make Information Identifiable  - use Http URIs as identifiers for all entities,
  2. Make Information Open and Shareable - store all information in RDF and expose as a service on an HTTP endpoint. (Remember the whole idea of Services in SOA!)
  3. Make Information Linkable - Re-use URIs when you refer to the same entity. Build interlinks to bridge conceptual worlds across networks and systems.
Why is a commitment to the RDF information model absolutely important?
3 reasons - Flexibility, Flexibility, and Flexibility. RDF is schemaless and offers the right model to deal with dynamic information at massive scale. To design a schema around information you always make two assumptions a) that you know enough about a domain and b) that the domain of knowledge itself remains fairly static.The second is hardly true even if you are a 'domain expert' - a phrase and a claim that I have found most salesmen make.

Entity Oriented Models and Indexing of Information
Earlier I mentioned that the Big Data Information Model must be able to unify the structured world (schema and Relational Databases) and the Unstructured world of posts, web pages,  and documents. RDF fills this requirement neatly.

Unstructured text today is primarily processed by Information Retrieval technology - understood by the masses as 'Search' while Relational and XML Databases represent the structured data world. Search relies on an information model that is based on documents and words (or phrases). RDF provides a graph based data model without mandating a schema that now defines entities as the key elements of the information model. The difference between words and entities is that an entity may be known by many words or phrases.

The implications are very many and useful - today search results provides a list of documents or web resources relevant to your query and relevance is primarily 'word' driven.Using RDF, you can now break out the documents and permit information use a the level of entities that are more fine grained.
Some Search users (especially the ones targeted by Enterprise search technology) have precise information need, that is not very well served by system that rely on words in documents as their smallest unit of information. These users want answers and these answers are typically about some aspect or feature of a collection of entities.
Therefore what we need is ability to uniquely identify (or simply tag) entities present in content and data in a consistent way, store information using the SPO model espoused by RDF to capture the various connections between entities or entities and literals to build a massive connected network or information graph and make them findable and reference-able on the web or within some restricted information space.

This kind of an information infrastructure will enable dynamic exploration of context and interaction between entities and also enable punctualization (to borrow terminology from Actor-Network Theory), where entire networks of entities can be studied as a whole (as behaving like a single entity) .This is what I should highlight as the flexibility of granularity in Information Analytics (or Big Data Analytics if you like it that way).


Comments

Good take, John. But I guess, you are at least 5 years ahead of the problem that Big data analytics practitioners would want to resolve urgently. Data-model designs are not as inelastic and two, granular data visibility is needed for which models or representations are made to cross-link/talk. I am a supporter of RDF too but not in analytics context. Please post about the research challenges that you'll are working on in analytics space.

Regards,
Shashank.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Please key in the two words you see in the box to validate your identity as an authentic user and reduce spam.

Subscribe to this blog's feed

Follow us on