Building real world solutions with Semantic Technology - putting the pieces together
1. There is a lot of interest in semantic technology, and that has led to some hype. Nevertheless semantic technology capabilities are vital as we attempt to develop solutions that help users to make sense of the information available to them. A Lack of clear articulation of goals and an understanding of business value of semantic technology in a specific business or information processing context is one of the main reasons for disappointment and disillusionment. This, I believe is the necessary first step in any enterprise initiative around semantic technology.
The best way to arrive at business value is to describe user-queries and interaction scenarios based on a clear information seeking goal.
For example in traveling from India to New York, how can one specify as part of the search to avoid certain cities / countries for transit. Look for use-case that require traversal of the relational structures within a domain or that require some inference from simple assertions in content. That brings us to an important question about the precise difference between traditional complex IT applications that rely on structured data say in RDBMs and application that use semantic technology. What are the criteria under which to choose between these two alternatives? The scenarios should amplify the role of semantic enrichment and metadata in delivering precise search answers. For example, when the user wants to find all the directors who have worked with a certain movie actor it requires extracting 'director relations' from movies and extracting the person in the 'director' filler of each relation.
Two problems with existing search technology that are primarily based on a statistical information model are
· Mis-formulation incomplete formulation of user goals in the user query and
· Mis-conception, the complete lack of context in processing search results.
Semantic technology provides the right relation based model for modeling information. However, in order to build complete content solutions, other related technology elements are also vital - like text analytics, document classification and indexing, search, user interface design and visualization.
2. Text analytics is a broad field; in relation to content analytics what is required is a subset of text analytics called information extraction that involves extracting structure from unstructured text. This includes named entity extraction, relation extraction and sentiment or opinion mining. It is important to keep in mind that information extraction research clearly stays away from the lofty goal of Natural Language Understanding (NLU) that attempts to build machines that can completely interpret text. The best minds in Artificial Intelligence have tried and have not fully succeeded to build a general purpose NLU machine. Along with other factors this has to do with the fundamental ambiguity and the complexity of interpreting language. For example, one cannot build a system that understands this sentence "I saw a boy with the telescope" as meaning "I saw a boy carrying a telescope" versus "I saw a boy through the telescope". Also state-of-the-art text extraction engines do not attempt inference since that involves modeling common sense and specific domain knowledge (if that is possible).
Infosys' specific research in this area is led by Dr. Lokendra Shastri who heads the Center for Knowledge driven Information Systems. Lokendra has conceived of the Infosys Semantic Extraction Engine (iSEE) - that attempts to combine common sense reasoning along with Domain specific lexicon, semantic processing using linking rules and embodied semantics to not only extract what is explicitly mentioned in the text bit also derive useful conclusions from them. The center also works on scalable realizations of rich semantic repositories and semantic search over such repositories.
3. Another area that requires clarity in content solutions is the differential role of taxonomic classification versus information extraction. Taxonomy represents one possible arrangement of the terms in a certain domain organized into a hierarchy based on the information granularity the term conveys. This works well in domains where the information is already well coded or lends itself to such classification like healthcare, life sciences or even the legal domain. However, even in these scenarios a major challenge is to constantly maintain the taxonomy to keep at breast with some subset of the real world. The other issue is that taxonomy represents one arrangement of the knowledge we have in a domain. It simply defines information buckets to put documents into and does not capture named relations between entities in the domain. This is where richer forms of knowledge organization like Ontologies are useful. However, Ontologies are heavy machinery and in some content applications what is really required is a formal model of the major concepts and the relations between them along with other relations like mutual exclusivity, synonyms and subsumption. This is more like a semantic concept network. We use the term domain knowledge model for this formal definition of the domain terms. Of course there are many challenges related to the semantics of these models and the associated inference. My take from the perspective of building practical systems is that while we want to rely on and respect all the progress the research community has made in relation to OWL semantics and inference. However, I do not agree with the current assumption of assuming a open-world assumption for all forms of inference. My personal observation is that you need systems that combine both open and closed world reasoning. Open world is best suited for subsumption reasoning on the concept descriptions in OWL. However when dealing with instance reasoning I find that closed world assumption is a lot closer to the requirements and understanding of business systems. This could be the observation of an applied researcher corrupted by his experience of building database based systems in the past. There are other researchers who have observed the deficiencies of tableau based reasoning algorithms when dealing with large data sets. I stumbled on the idea on relying on existing data retrieval techniques (read SQL) to build scalable instance reasoning. Later I found that this line of thought has been well studied by researchers at Karlsruhe who developed the KAON system. For the scope of this post, it is sufficient to summarize that taxonomy works at the granularity of a document while Ontologies or knowledge models works at the richer granularity of named relations and entity types.
4. Even if you had all the above technology components to model knowledge, to extract structure from text, to store a index documents and facts you still need to think of how the user is expected to interact with the system. I see two independent processing or workflows in content analytics. The first examines content enrichment that creates richer structure and index from existing content annotated against some formal domain model. The second involves the actual process of users leveraging this in satisfying his information goals. We now need to move from content enrichment in semantic repositories to semantic processing of user keywords - what I call Keyword based Semantic Search.
There is impedance mismatch between the expected technique for retrieving information from semantic repositories and what users provide. Users are accustomed to keyword based search as exemplified by Google and Bing. Semantic repositories expect structured queries (like SPARQL) in order to retrieve information. In this sense semantic repositories can be analogous to database systems. This implies immediately that you could use such ontology based systems for analytics even over structured data that today require data warehouses based on RDBMS technology. The pros and cons of this approach demand a separate post. One simple reason to consider semantic technology for business intelligence over structured data is to achieve complete de-coupling between the query layers and the physical data layer. This means that users of semantic queries do not worry about actual location of data and the results could even be federated from multiple physical data stores even outside the enterprise boundary. The second is the de-coupling between the schema and the physical data model in storage.. However, this usage scenario assumes that you have someone typing queries in a language like SPARQL. This surfaces the problem of a semantic gap in any semantic technology system since users cannot be expected to completely and precisely express their information goals in structured queries. The responsibility of a Semantic search component is to process the user supplied keywords on the semantic knowledge and data and generate a ranked set of valid and relevant system queries in SPARQL that will then be evaluated over the semantic repository. This is our focus of semantic search.
5. Finally, there is much work to be done in the area of user interface design based on the richer information model and capabilities provided by all of the above technologies working together. The problems is we are conditioned by what we see and use regularly that a lot of existing systems simply mimic Google and return the tops ten documents in response to a user query. Even returning a graphical visualization of relations in the repository is also naïve unless there is some thought to how the knowledge available is put to effective use to help users quickly move from the general results to very specific answers or explore the semantic space of relatedness.