Big Data Analytics: The role of granurality of the information model
What I'd like to describe is the importance having a information model that can unify both unstructured content and structured data and a few principles of designing or selecting IT infrastructure to support this new model.
Time Berner Lee's vision of Linked Open Data (abbreviated as LOD) espouse a vision for Information Management based on RDF. Now RDF is language for conceptual description or information modeling . It is important to distinguish between the logical structure of a Subject-Predicate-Object (SPO) expression and the many serializations of RDF. like RDF/XML, N3 or Turtle.
Why is LOD important in the context of 'Big Data Analytics'?
LOD espouses 3 primary things.
- Make Information Identifiable - use Http URIs as identifiers for all entities,
- Make Information Open and Shareable - store all information in RDF and expose as a service on an HTTP endpoint. (Remember the whole idea of Services in SOA!)
- Make Information Linkable - Re-use URIs when you refer to the same entity. Build interlinks to bridge conceptual worlds across networks and systems.
3 reasons - Flexibility, Flexibility, and Flexibility. RDF is schemaless and offers the right model to deal with dynamic information at massive scale. To design a schema around information you always make two assumptions a) that you know enough about a domain and b) that the domain of knowledge itself remains fairly static.The second is hardly true even if you are a 'domain expert' - a phrase and a claim that I have found most salesmen make.
Entity Oriented Models and Indexing of Information
Earlier I mentioned that the Big Data Information Model must be able to unify the structured world (schema and Relational Databases) and the Unstructured world of posts, web pages, and documents. RDF fills this requirement neatly.
Unstructured text today is primarily processed by Information Retrieval technology - understood by the masses as 'Search' while Relational and XML Databases represent the structured data world. Search relies on an information model that is based on documents and words (or phrases). RDF provides a graph based data model without mandating a schema that now defines entities as the key elements of the information model. The difference between words and entities is that an entity may be known by many words or phrases.
The implications are very many and useful - today search results provides a list of documents or web resources relevant to your query and relevance is primarily 'word' driven.Using RDF, you can now break out the documents and permit information use a the level of entities that are more fine grained.
Some Search users (especially the ones targeted by Enterprise search technology) have precise information need, that is not very well served by system that rely on words in documents as their smallest unit of information. These users want answers and these answers are typically about some aspect or feature of a collection of entities.
Therefore what we need is ability to uniquely identify (or simply tag) entities present in content and data in a consistent way, store information using the SPO model espoused by RDF to capture the various connections between entities or entities and literals to build a massive connected network or information graph and make them findable and reference-able on the web or within some restricted information space.
This kind of an information infrastructure will enable dynamic exploration of context and interaction between entities and also enable punctualization (to borrow terminology from Actor-Network Theory), where entire networks of entities can be studied as a whole (as behaving like a single entity) .This is what I should highlight as the flexibility of granularity in Information Analytics (or Big Data Analytics if you like it that way).