The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

Main

May 24, 2012

The 'search results page' makeover

There was a time, about 10 years ago, when all that a 'search results page' needed to show, to be impressive, was a list of links to web-pages. But the increasing user expectations and the motivation to provide an efficient way to reach information goal has led to many changes in the search results pages over the years.

The modifications started with snippets of matching texts appearing below each link and the matched words appeared in 'bold' letters. Later we later saw images, maps and videos making it to search results page. We also saw links to news, blogs and books on the page. Also came up were suggestions for related searches. .. all of this in an effort to make the search experience more efficient; the idea is that the user should need to spend much less time to get the information he/she is looking for.

With the industry focus on 'structured information' and semantic search; the search results page is undergoing further modifications. Recently Google blogged that they will soon show a 'Knowledge Graph' to enhance the google Search experience. At Infosys too, we do research and build solutions for engaging visualizations of information(structured as well as unstructured). 

 Our idea is to provide information not merely as documents; but in a structured form - like a table, list, graph etc. The appeal of such visualizations lies in the following -

  • They save the humans from having to read large texts (while also dealing with redundancy)
  • These are direct responses which can be consumed very easily by machines as well as humans
  • A picture is equivalent to a thousand words
  • These visualizations are extremely engaging and quite interactive.

 

And such visualizations would definitely define the next makeover of Search results page!!

"Semantic Search" - understanding what it is

We all use 'search engine' like Google to get what we are looking for. We type a keyword and get pointed to resources that contain mentions of that keyword; and they are very well ordered based on multiple relevant parameters. This is a typical "statistical" search...

 

1.       But, what if you had to search something like" Directors of Kate Winslet"?  A statistical search engine does not understand the information goal; but merely returns the results which match the keywords. The top results seen on a statistical search engine page are not about 'people who have directed Kate Winslet'; but about 'Kate Winslet's director husband'.. simply because the user keywords match these texts also. The other shortcoming is that the results are essentially web-pages which need to be manually ingested to find answers to information goal. Then there is also a repetition of information across web-pages.

 

Let me now introduce "Semantic Search". The literal meaning of the word 'semantics' is 'meaning'. And thus, 'Semantic search' translates to 'meaning based search' ..

 

Semantic search aims to address these issues by applying semantics or meaning in 2 places - by understanding the meaning of keyword itself and by understanding the searchable content. And it further integrates the information from the various content resources. Thus the outcome of same keyword will be more semantically relevant results and no repetition of results. Further the results will be presented in a more direct format; be it a table, a graph or a list. These structured formats are easily consumed by humans and machines alike.

 

What I explained just now is the eventual aim of semantic search; and the industry has made some substantial advances towards it. Lot of work is going on; major search engines are spending lot of money to semantify the search experience.. this can be seen in google's 'best guess'; abbreviation handing and synonym understanding features.

 

Its a very active and interesting space. We at Infosys Labs are excited about this and are doing various research and conceptualizing solutions for semantic search. Our research is around text analytics, semantic search and visualization. 

April 3, 2012

Google SERP's new 'semantic' feature

Google quietly introduced an exciting feature recently on its SERP (Search Engine Result Pages). The feature is the 'Best Guess' feature. And wonderful it is because it quite appears like Question-answering on common knowledge and general relations.
Google can now tell you names of spouse (of celebrities of course), names of children, ceo's of company, birthday of someone; capital of a country.. and these are not wrapped in documents but picked out and put there as Best Guess on top of the SERP. This is very smart.

Example--
query - Director of Titanic
Best guess for Titanic Director is James Cameron
Mentioned on at least 8 websites including wikipedia.org, imdb.com and answers.com - Show sourcesHide sources - Feedback

children of barack obama
Best guess for Barack Obama Children is Natasha Obama, Malia Ann Obama
Mentioned on about.com - Show sourcesHide sources - Feedback

And it also tells you, where it picked up these guesses from... in the form of 'Mentioned on xx websites including a.com,b.com etc etc'...

This feature is looking quite exciting. It will certainly change the way people search the web and what they expect from the web search engines.

This feature has been around for about an year(and maybe more) but has not garnered a lot of heat yet; may be because the guesses google makes(so far), are really common knowledge and probably does not help an information seeker a lot. Or maybe because information seeker knows the website that gives him guaranteed infortion and does not follow the search engine route to get there. As of now, for me, its more of a play than a smart 'answering' mechanism. But, I am hopeful that this feature will be enriched and more elolved in future.

I put my little thought into how google must be doing this. I thought its doing extraction from socially trusted sources(like Wikipedia) and building a database of important relations. But this is a thought.

Comments welcome!

November 23, 2011

Big Data Analytics: The role of granurality of the information model

Gartner, McKinsey and other analysts together have popularized the opportunity of 'Big Data', often referring to the massive amount of content and data available today. It presents an opportunity to develop techniques for sifting and processing data at massive scale such that it can be consumed by users or even by systems and processes (machine-readable and human-readable).This is the opportunity for Big Data Analytics.

What I'd like to describe is the importance having a information model that can unify both unstructured content and structured data and a few principles of designing or selecting IT infrastructure to support this new model.

Time Berner Lee's vision of Linked Open Data (abbreviated as LOD) espouse a vision for Information Management based on RDF. Now RDF is language for conceptual description or information modeling . It is important to distinguish between the logical structure of a Subject-Predicate-Object (SPO) expression and the many serializations of RDF. like RDF/XML, N3 or Turtle.

Why is LOD  important in the context of 'Big Data Analytics'?
LOD espouses 3 primary things.
  1. Make Information Identifiable  - use Http URIs as identifiers for all entities,
  2. Make Information Open and Shareable - store all information in RDF and expose as a service on an HTTP endpoint. (Remember the whole idea of Services in SOA!)
  3. Make Information Linkable - Re-use URIs when you refer to the same entity. Build interlinks to bridge conceptual worlds across networks and systems.
Why is a commitment to the RDF information model absolutely important?
3 reasons - Flexibility, Flexibility, and Flexibility. RDF is schemaless and offers the right model to deal with dynamic information at massive scale. To design a schema around information you always make two assumptions a) that you know enough about a domain and b) that the domain of knowledge itself remains fairly static.The second is hardly true even if you are a 'domain expert' - a phrase and a claim that I have found most salesmen make.

Entity Oriented Models and Indexing of Information
Earlier I mentioned that the Big Data Information Model must be able to unify the structured world (schema and Relational Databases) and the Unstructured world of posts, web pages,  and documents. RDF fills this requirement neatly.

Unstructured text today is primarily processed by Information Retrieval technology - understood by the masses as 'Search' while Relational and XML Databases represent the structured data world. Search relies on an information model that is based on documents and words (or phrases). RDF provides a graph based data model without mandating a schema that now defines entities as the key elements of the information model. The difference between words and entities is that an entity may be known by many words or phrases.

The implications are very many and useful - today search results provides a list of documents or web resources relevant to your query and relevance is primarily 'word' driven.Using RDF, you can now break out the documents and permit information use a the level of entities that are more fine grained.
Some Search users (especially the ones targeted by Enterprise search technology) have precise information need, that is not very well served by system that rely on words in documents as their smallest unit of information. These users want answers and these answers are typically about some aspect or feature of a collection of entities.
Therefore what we need is ability to uniquely identify (or simply tag) entities present in content and data in a consistent way, store information using the SPO model espoused by RDF to capture the various connections between entities or entities and literals to build a massive connected network or information graph and make them findable and reference-able on the web or within some restricted information space.

This kind of an information infrastructure will enable dynamic exploration of context and interaction between entities and also enable punctualization (to borrow terminology from Actor-Network Theory), where entire networks of entities can be studied as a whole (as behaving like a single entity) .This is what I should highlight as the flexibility of granularity in Information Analytics (or Big Data Analytics if you like it that way).


May 27, 2011

Next Gen BI based on Semantic Technology

Here are my thoughts on how BI tools and technology can leverage semantic technology.

  • BI based on the relational model of data is no longer viable with the data complexity exceeding the limits afforded by RDBMS technology. Other then scale, there is also the issue that RDBMS queries are heavily coupled to the physical data organization while semantic querying in SPARQL permits complete abstraction of the physical organization of data. This is extremely useful when executing BI queries over distributed data sources.

  • Another important issue to consider is the opportunity to use inference capabilities provided by semantic repositories in BI analytics to draw conclusions from what is already stated  in the database.

  • The other is the ability to use semantic analysis and information extraction to populate knowledge bases based on semantic technology that can combined structured and unstructured data in a uniform schema less manner providing the benefit of schema flexibility and information linking. This is very useful for information integration.

  • Scalable Consistency checking within BI is now possible by the use of semantic technology. Rule based consistency checking has been around for quite some time. However semantic technology provides  a standards based stack of knowledge languages, data models, and inference engines to facilitate better adoption.

January 11, 2011

Building real world solutions with Semantic Technology - putting the pieces together

My colleagues and I at the Center for Knowledge driven Information Systems (CKDIS) as part of Infosys Applied Research & Development labs have been studying challenges and gaps in the current technology stack for large scale industrial deployment of semantic technology and text analytics especially as it relates to content analytics. My role within CKDIS is to directly lead research and manage research outcomes and also incorporate our understanding from research into large scale solution architecture for industrial deployment that enable business.

1. There is a lot of interest in semantic technology, and that has led to some hype. Nevertheless semantic technology capabilities are vital as we attempt to develop solutions that help users to make sense of the information available to them. A Lack of clear articulation of goals and an understanding of business value of semantic technology in a specific business or information processing context is one of the main reasons for disappointment and disillusionment.   This, I believe is the necessary first step in any enterprise initiative around semantic technology.

The best way to arrive at business value is to describe user-queries and interaction scenarios based on a clear information seeking goal.

For example in traveling from India to New York, how can one specify as part of the search to avoid certain cities / countries for transit. Look for use-case that require traversal of the relational structures within a domain or that require some inference from simple assertions in content. That brings us to an important question about the precise difference between traditional complex IT applications that rely on structured data say in RDBMs and application that use semantic technology.  What are the criteria under which to choose between these two alternatives?  The scenarios should amplify the role of semantic enrichment and metadata in delivering precise search answers. For example, when the user wants to find all the directors who have worked with a certain movie actor it requires extracting 'director relations' from movies and extracting the person in the 'director' filler of each relation.

Two problems with existing search technology that are primarily based on a statistical information model are

·         Mis-formulation incomplete formulation of user goals in the user query and

·         Mis-conception, the complete lack of context in processing search results.

Semantic technology provides the right relation based model for modeling information.  However, in order to build complete content solutions, other related technology elements are also vital - like text analytics, document classification and indexing, search, user interface design and visualization.

2. Text analytics is a broad field; in relation to content analytics what is required is a subset of text analytics called information extraction that involves extracting structure from unstructured text. This includes named entity extraction, relation extraction and sentiment or opinion mining. It is important to keep in mind that information extraction research clearly stays away from the lofty goal of Natural Language Understanding (NLU) that attempts to build machines that can completely interpret text.  The best minds in Artificial Intelligence have tried and have not fully succeeded to build a general purpose NLU machine.  Along with other factors this has to do with the fundamental ambiguity and the complexity of interpreting language. For example, one cannot build a system that understands this sentence "I saw a boy with the telescope" as meaning "I saw a boy carrying  a telescope" versus "I saw a boy through the telescope". Also state-of-the-art text extraction engines do not attempt inference since that involves modeling common sense and specific domain knowledge (if that is possible).

Infosys' specific research in this area is led by Dr. Lokendra Shastri who heads the Center for Knowledge driven Information Systems.  Lokendra has conceived of the Infosys Semantic Extraction Engine (iSEE) - that attempts to combine common sense reasoning along with Domain specific lexicon, semantic processing using linking rules and embodied semantics to not only extract what is explicitly mentioned in the text bit also derive useful conclusions from them. The center also works on scalable realizations of rich semantic repositories and semantic search over such repositories.

3. Another area that requires clarity in content solutions is the differential role of taxonomic classification versus information extraction. Taxonomy represents one possible arrangement of the terms in a certain domain organized into a hierarchy based on the information granularity the term conveys. This works well in domains where the information is already well coded or lends itself to such classification like healthcare, life sciences or even the legal domain. However, even in these scenarios a major challenge is to constantly maintain the taxonomy to keep at breast with some subset of the real world. The other issue is that taxonomy represents one arrangement of the knowledge we have in a domain. It simply defines information buckets to put documents into and does not capture named relations between entities in the domain.  This is where richer forms of knowledge organization like Ontologies are useful. However, Ontologies are heavy machinery and in some content applications what is really required is a formal model of the major concepts and the relations between them along with other relations like mutual exclusivity, synonyms and subsumption. This is more like a semantic concept network. We use the term domain knowledge model for this formal definition of the domain terms. Of course there are many challenges related to the semantics of these models and the associated inference. My take from the perspective of building practical systems is that while we want to rely on and respect all the progress the research community has made in relation to OWL semantics and inference. However, I do not agree with the current assumption of assuming a open-world assumption for all forms of inference. My personal observation is that you need systems that combine both open and closed world reasoning. Open world is best suited for subsumption reasoning on the concept descriptions in OWL.  However when dealing with instance reasoning  I find that closed world assumption is a lot closer to the requirements and understanding of  business systems. This could be the observation of an applied researcher corrupted by his experience of building database based systems in the past. There are other researchers who have observed the deficiencies of tableau based reasoning algorithms when dealing with large data sets.  I stumbled on the idea on relying on existing data retrieval techniques (read SQL) to build scalable instance reasoning. Later I found that this line of thought has been well studied by researchers at Karlsruhe who developed the KAON system.   For the scope of this post, it is sufficient to summarize that taxonomy works at the granularity of a document while Ontologies or knowledge models works at the richer granularity of named relations and entity types.

4.  Even if you had all the above technology components to model knowledge, to extract structure from text, to store a index documents and facts you still need to think of how the user is expected to interact with the system. I see two independent processing or workflows in content analytics. The first examines content enrichment that creates richer structure and index from existing content annotated against some formal domain model. The second involves the actual  process of users leveraging this in satisfying his information goals. We now need to move from content enrichment in semantic repositories to semantic processing of user keywords - what I call Keyword based Semantic Search.  
There is impedance mismatch between the expected technique for retrieving information from semantic repositories and what users provide. Users are accustomed to keyword based search as exemplified by Google and Bing.  Semantic repositories expect structured queries (like SPARQL) in order to retrieve information. In this sense semantic repositories can be analogous to database systems. This implies immediately that you could use such ontology based systems for analytics even over structured data that today require data warehouses based on RDBMS technology. The pros and cons of this approach demand a separate post.  One simple reason to consider semantic technology for business intelligence over structured data is to achieve complete de-coupling between the query layers and the physical data layer. This means that users of semantic queries do not worry about actual location of data and the results could even be federated from multiple physical data stores even outside the enterprise boundary. The second is the de-coupling between the schema and the physical data model in storage.. However, this usage scenario assumes that you have someone typing queries in a language like SPARQL. This surfaces the problem of a semantic gap in any semantic technology system since users cannot be expected to completely and precisely express their information goals in structured queries. The  responsibility of a Semantic search component is to process the user supplied keywords  on the semantic knowledge and data and generate a ranked set of valid and relevant  system queries in SPARQL that will then be evaluated over the semantic repository. This is our focus of semantic search.

5. Finally, there is much work to be done in the area of user interface design based on the richer information model and capabilities provided by all of the above technologies working together.  The problems is we are conditioned by what we see and use regularly that a lot of existing systems simply mimic Google and return the tops ten documents in response to a user query. Even returning a graphical visualization of relations in the repository is also naïve unless there is some thought to how the knowledge available is put to effective use to help users quickly move from the general results to very specific answers or explore the semantic space of relatedness. 

Infosys Information Services Industry, Leadership Meet 2010

The first Infosys Information industry leadership meet was organized from at New York on Oct 29, 2010.

The meet organized by Infosys Technologies for select clients from the Information Services practice provided a comprehensive roundup on the emerging opportunities in the Information Industry from the perspective of an analyst [IDC], an industry player [Elsevier] and a technology provider [Infosys]. I was one of key speakers presenting Infosys' viewpoint and our insights in this space along with Susan Feldman from IDC who needs no introduction and Mirko Minnich  who is SVP, Product Technology Strategy at Elsevier  representing the publishing industry.

I.      Susan Feldman  Research Vice President at IDC was the keynote speaker at the event. Susan directs the Content Technologies Group at IDC, specializing in search and discovery software and digital marketplace technologies and dynamics. Susan spoke on the market opportunities within the digital marketplace and how semantic tools fit in to the challenges and opportunities of in this space.  

According to IDC, the Digital Marketplace has reached a tipping point and will see double-digit growth rates until at least the end of this decade. The opportunity is about the kind of information and products and search results needed for individualized content. There's uncertainty about revenue streams, but publishers are experimenting. Since more clicks to content mean more revenues for content, relevancy ranking in search results is gaining more attention from publishers. This requires information systems to help people to work in groups to find answers using these tools. Metadata management is booming as a result, as is developing taxonomy tools, discovery applications for researchers, verifying information, sales prospect generation and lead generation. Semantic tools pay for themselves many times over if they're deployed for ecommerce and used to ease the requirement for hiring. But information vendors are slow to adopt these tools. They need to stop thinking about content being at the center of their products and to think about the information surrounding it.
Basic technologies include inference engines, text analytics, reporting tools, BI, data mining, moving up into image search, sentiment extraction, fact/event extraction, relationship extraction, geo-tagging, concept extraction, entity extraction, multilingual support, categorization and browsing, speech tagging, speech to text, search and relevance ranking.

Success in the digital marketplace is based on

·         Acquire content - possible by either aggregating from other sources or relying on tools to obtain user-generated content form a community focused on domain knowledge.

·         Provide tools that help to find relevant information at the right granularity.

Information finding tools via filtered search, browsing and analysis tools are also key for content success. For example, ediscovery via lawyers needed to find information for litigation took off quickly, people can package up this information easily for use and sharing.

·         Getting the right context for answers is very important.

Opportunities

·         The "too much information" challenge means that people can no longer afford to hire people to understand information at the rate that it comes in, creating an opportunity for the publishing industry.

·         Extracting Leads targeted for a specific industry according to Susan is also an under-utilized opportunity.

·         Corporate Users especially are interested to see up special views of their content instead of queuing up to their IT departments. Publishers have content that can fit into this landscape, especially with tools that provide more visually-oriented solutions.

·         Trend-spotting is key to hedge funds, as they look for emerging trends that they can analyze and monetize through securities investments. Analytics are important, but have to include unstructured streams such as emails. Businesses need trusted information, they need more competitive intelligence, they need the tools to understand vast amounts of information via findability tools as end users push beyond analysts and researchers to user search tools themselves.

·         The need to understand who is the user, what is the task, what location, what will the user do with the search results. They need real-time information, and a re-evaluation of the value that is being provided. Is it the taxonomy that's more valuable to sell, or the targeted content? Is the information about personal relationships in a social network more important?

Points that Susan mentioned that technology vendors need to consider

·         Search and discovery technologies - what can they do beyond the search box?

·         "Fuzzy matching," which enables larger collections of information to be more usable by surfacing good results quickly, even if they're not entirely inclusive answers. She mentioned that we're moving towards surfacing information about better matches, related matches and so on.

·         Other techniques such as understanding sentence, word and paragraph structure, relevance ranking, and supporting ad hoc information access.

·         Tools that can help them get contextual answers - what are related actions that they'd like to provide, are they geo-specific requests, device specific, content by category/entity, are all important.

·         Multi-lingual extraction is also an important feature for any large information provider.

·         Manual versus automated tagging trade-offs - manual approach to tagging provide high precision for low volumes, vice versa for automated. There is a certain "golden mean" in which a combination of automated and manual tagging support can manage high-volume accuracy (comment: think of Dow Jones' new Consultant product, which uses search experts to fine-tune queries of content from Factiva databases for specific topic domains, problems and opportunities.

Solutions of interest mentioned by Susan

·         Illumin8 is an example Sue mentioned from Elsevier that allows innovation professionals to extract information about opportunities for innovation and to insert it into spreadsheets.

·         Information management companies like Iron Mountain are looking at extracting the "atoms" of information, mix them around using advanced search indexes and to make them more reusable. Tagging them thoroughly up front is key to this process, so that they can "talk" across and in applications. For example, if you have a customer, there is information about that customer in many repositories that need to have this information combined and atomized.

·         Bing's Powerset search enrichment tools helps to provide personally contextual information such as restaurants in the area of a location or weather when they understand that a query refers to a location (Google also, obviously). It may also list flights to and from that location, and so on.

·         Temis entity extraction enables this in publications like Nature Publishing Group's online Chemistry publication, providing not only document references but embedded content such as the chemical structure of referenced compounds

·         Attensity360 provides sentiment/opinion extraction to see what people are thinking about products and services. Also important for monitoring traditional media to see the fluctuation of both mentions and sentiment.

·     Autonomy Explore shows clusters of data in a graphical map and tag clouds.