The Infosys Labs research blog tracks trends in technology with a focus on applied research in Information and Communication Technology (ICT)

« December 2010 | Main | February 2011 »

January 31, 2011

Is your workflow becoming a bottleneck in addressing production issues on time?

In one of my interactions with a client manager, we were discussing about ways through which the development team can respond faster in the event of a performance problem occurring in production environments.

As with any exercise, understanding and analyzing the root cause of a performance bottleneck begins with collecting all relevant information that would assist the development team to re-create and troubleshoot the problem scenario in development environment.

A most trustworthy, and at times, the only source to identify root cause of any production problems are the logs generated by servers across different tiers. For instance, a web server log could yield valuable insights on application's workload helping the development team to re-create production workload in test environment. Application server logs would provide insights on transactions and any exceptions that triggered the problem.  Thus, it's imperative they are made available to development team.

Now, the real challenge seems to be this - how sooner can the development team gain access to logs? As with most organizations, the production environment and logs are managed by infrastructure team. For any problem analysis, the development team needs to request logs for required duration from infrastructure team. The existing log management practices, security and privacy regulations, and approval channels could induce certain unavoidable delays in delivering these logs. Nevertheless, the maturity of your processes is weighed by how these delays can be optimized. How much time does the development team need to wait before they get to see the logs - is it in hours or days?  Are your operational processes becoming a bottleneck here?

January 13, 2011

Important Perfmon Metrics for Identifying Processor related Issues

Performance counters play a vital role in identifying performance issues of a software application. It is most important to monitor and analyze the relevant performance counters to quickly pin point any performance issues. Here I will mention few Perfmon counters that help in identifying Processor level performance bottlenecks.


Processor level issues arise when processor is very busy and it takes time to respond to the requests. High value for Perfmon counters: "Processor: % Processor Time", "System: Context Switches/sec" and "System\Processor Queue Length" indicates processor level issues.  A "Context Switch" occurs when the kernel switches the processor from one thread to another or a thread with high priority takes control from a thread with lower priority or a running thread has to wait for some I/O operation. A high rate of context switching means that the processor is being shared repeatedly by many threads and because of this high amount of time is spent in saving and loading states of running/ready threads. "Processor Queue Length" value is a measure of how many threads are in the Ready state waiting to be processed, high value thus indicates that tasks have to wait for longer duration to get processed.


We can still drill down and find whether high Processor utilization is caused due to Hardware or Software. Perfmon counters like "Processor: Interrupts /sec" and "System: System Calls/sec" will help in determining that. If "Processor: Interrupts /sec" value is high then it reflects that high Processor utilization is due to Hardware, and if "System: System Calls/sec" value is high then it reflects that high Processor utilization is due to Software.

January 11, 2011

Building real world solutions with Semantic Technology - putting the pieces together

My colleagues and I at the Center for Knowledge driven Information Systems (CKDIS) as part of Infosys Applied Research & Development labs have been studying challenges and gaps in the current technology stack for large scale industrial deployment of semantic technology and text analytics especially as it relates to content analytics. My role within CKDIS is to directly lead research and manage research outcomes and also incorporate our understanding from research into large scale solution architecture for industrial deployment that enable business.

1. There is a lot of interest in semantic technology, and that has led to some hype. Nevertheless semantic technology capabilities are vital as we attempt to develop solutions that help users to make sense of the information available to them. A Lack of clear articulation of goals and an understanding of business value of semantic technology in a specific business or information processing context is one of the main reasons for disappointment and disillusionment.   This, I believe is the necessary first step in any enterprise initiative around semantic technology.

The best way to arrive at business value is to describe user-queries and interaction scenarios based on a clear information seeking goal.

For example in traveling from India to New York, how can one specify as part of the search to avoid certain cities / countries for transit. Look for use-case that require traversal of the relational structures within a domain or that require some inference from simple assertions in content. That brings us to an important question about the precise difference between traditional complex IT applications that rely on structured data say in RDBMs and application that use semantic technology.  What are the criteria under which to choose between these two alternatives?  The scenarios should amplify the role of semantic enrichment and metadata in delivering precise search answers. For example, when the user wants to find all the directors who have worked with a certain movie actor it requires extracting 'director relations' from movies and extracting the person in the 'director' filler of each relation.

Two problems with existing search technology that are primarily based on a statistical information model are

·         Mis-formulation incomplete formulation of user goals in the user query and

·         Mis-conception, the complete lack of context in processing search results.

Semantic technology provides the right relation based model for modeling information.  However, in order to build complete content solutions, other related technology elements are also vital - like text analytics, document classification and indexing, search, user interface design and visualization.

2. Text analytics is a broad field; in relation to content analytics what is required is a subset of text analytics called information extraction that involves extracting structure from unstructured text. This includes named entity extraction, relation extraction and sentiment or opinion mining. It is important to keep in mind that information extraction research clearly stays away from the lofty goal of Natural Language Understanding (NLU) that attempts to build machines that can completely interpret text.  The best minds in Artificial Intelligence have tried and have not fully succeeded to build a general purpose NLU machine.  Along with other factors this has to do with the fundamental ambiguity and the complexity of interpreting language. For example, one cannot build a system that understands this sentence "I saw a boy with the telescope" as meaning "I saw a boy carrying  a telescope" versus "I saw a boy through the telescope". Also state-of-the-art text extraction engines do not attempt inference since that involves modeling common sense and specific domain knowledge (if that is possible).

Infosys' specific research in this area is led by Dr. Lokendra Shastri who heads the Center for Knowledge driven Information Systems.  Lokendra has conceived of the Infosys Semantic Extraction Engine (iSEE) - that attempts to combine common sense reasoning along with Domain specific lexicon, semantic processing using linking rules and embodied semantics to not only extract what is explicitly mentioned in the text bit also derive useful conclusions from them. The center also works on scalable realizations of rich semantic repositories and semantic search over such repositories.

3. Another area that requires clarity in content solutions is the differential role of taxonomic classification versus information extraction. Taxonomy represents one possible arrangement of the terms in a certain domain organized into a hierarchy based on the information granularity the term conveys. This works well in domains where the information is already well coded or lends itself to such classification like healthcare, life sciences or even the legal domain. However, even in these scenarios a major challenge is to constantly maintain the taxonomy to keep at breast with some subset of the real world. The other issue is that taxonomy represents one arrangement of the knowledge we have in a domain. It simply defines information buckets to put documents into and does not capture named relations between entities in the domain.  This is where richer forms of knowledge organization like Ontologies are useful. However, Ontologies are heavy machinery and in some content applications what is really required is a formal model of the major concepts and the relations between them along with other relations like mutual exclusivity, synonyms and subsumption. This is more like a semantic concept network. We use the term domain knowledge model for this formal definition of the domain terms. Of course there are many challenges related to the semantics of these models and the associated inference. My take from the perspective of building practical systems is that while we want to rely on and respect all the progress the research community has made in relation to OWL semantics and inference. However, I do not agree with the current assumption of assuming a open-world assumption for all forms of inference. My personal observation is that you need systems that combine both open and closed world reasoning. Open world is best suited for subsumption reasoning on the concept descriptions in OWL.  However when dealing with instance reasoning  I find that closed world assumption is a lot closer to the requirements and understanding of  business systems. This could be the observation of an applied researcher corrupted by his experience of building database based systems in the past. There are other researchers who have observed the deficiencies of tableau based reasoning algorithms when dealing with large data sets.  I stumbled on the idea on relying on existing data retrieval techniques (read SQL) to build scalable instance reasoning. Later I found that this line of thought has been well studied by researchers at Karlsruhe who developed the KAON system.   For the scope of this post, it is sufficient to summarize that taxonomy works at the granularity of a document while Ontologies or knowledge models works at the richer granularity of named relations and entity types.

4.  Even if you had all the above technology components to model knowledge, to extract structure from text, to store a index documents and facts you still need to think of how the user is expected to interact with the system. I see two independent processing or workflows in content analytics. The first examines content enrichment that creates richer structure and index from existing content annotated against some formal domain model. The second involves the actual  process of users leveraging this in satisfying his information goals. We now need to move from content enrichment in semantic repositories to semantic processing of user keywords - what I call Keyword based Semantic Search.  
There is impedance mismatch between the expected technique for retrieving information from semantic repositories and what users provide. Users are accustomed to keyword based search as exemplified by Google and Bing.  Semantic repositories expect structured queries (like SPARQL) in order to retrieve information. In this sense semantic repositories can be analogous to database systems. This implies immediately that you could use such ontology based systems for analytics even over structured data that today require data warehouses based on RDBMS technology. The pros and cons of this approach demand a separate post.  One simple reason to consider semantic technology for business intelligence over structured data is to achieve complete de-coupling between the query layers and the physical data layer. This means that users of semantic queries do not worry about actual location of data and the results could even be federated from multiple physical data stores even outside the enterprise boundary. The second is the de-coupling between the schema and the physical data model in storage.. However, this usage scenario assumes that you have someone typing queries in a language like SPARQL. This surfaces the problem of a semantic gap in any semantic technology system since users cannot be expected to completely and precisely express their information goals in structured queries. The  responsibility of a Semantic search component is to process the user supplied keywords  on the semantic knowledge and data and generate a ranked set of valid and relevant  system queries in SPARQL that will then be evaluated over the semantic repository. This is our focus of semantic search.

5. Finally, there is much work to be done in the area of user interface design based on the richer information model and capabilities provided by all of the above technologies working together.  The problems is we are conditioned by what we see and use regularly that a lot of existing systems simply mimic Google and return the tops ten documents in response to a user query. Even returning a graphical visualization of relations in the repository is also naïve unless there is some thought to how the knowledge available is put to effective use to help users quickly move from the general results to very specific answers or explore the semantic space of relatedness. 

Infosys Information Services Industry, Leadership Meet 2010

The first Infosys Information industry leadership meet was organized from at New York on Oct 29, 2010.

The meet organized by Infosys Technologies for select clients from the Information Services practice provided a comprehensive roundup on the emerging opportunities in the Information Industry from the perspective of an analyst [IDC], an industry player [Elsevier] and a technology provider [Infosys]. I was one of key speakers presenting Infosys' viewpoint and our insights in this space along with Susan Feldman from IDC who needs no introduction and Mirko Minnich  who is SVP, Product Technology Strategy at Elsevier  representing the publishing industry.

I.      Susan Feldman  Research Vice President at IDC was the keynote speaker at the event. Susan directs the Content Technologies Group at IDC, specializing in search and discovery software and digital marketplace technologies and dynamics. Susan spoke on the market opportunities within the digital marketplace and how semantic tools fit in to the challenges and opportunities of in this space.  

According to IDC, the Digital Marketplace has reached a tipping point and will see double-digit growth rates until at least the end of this decade. The opportunity is about the kind of information and products and search results needed for individualized content. There's uncertainty about revenue streams, but publishers are experimenting. Since more clicks to content mean more revenues for content, relevancy ranking in search results is gaining more attention from publishers. This requires information systems to help people to work in groups to find answers using these tools. Metadata management is booming as a result, as is developing taxonomy tools, discovery applications for researchers, verifying information, sales prospect generation and lead generation. Semantic tools pay for themselves many times over if they're deployed for ecommerce and used to ease the requirement for hiring. But information vendors are slow to adopt these tools. They need to stop thinking about content being at the center of their products and to think about the information surrounding it.
Basic technologies include inference engines, text analytics, reporting tools, BI, data mining, moving up into image search, sentiment extraction, fact/event extraction, relationship extraction, geo-tagging, concept extraction, entity extraction, multilingual support, categorization and browsing, speech tagging, speech to text, search and relevance ranking.

Success in the digital marketplace is based on

·         Acquire content - possible by either aggregating from other sources or relying on tools to obtain user-generated content form a community focused on domain knowledge.

·         Provide tools that help to find relevant information at the right granularity.

Information finding tools via filtered search, browsing and analysis tools are also key for content success. For example, ediscovery via lawyers needed to find information for litigation took off quickly, people can package up this information easily for use and sharing.

·         Getting the right context for answers is very important.

Opportunities

·         The "too much information" challenge means that people can no longer afford to hire people to understand information at the rate that it comes in, creating an opportunity for the publishing industry.

·         Extracting Leads targeted for a specific industry according to Susan is also an under-utilized opportunity.

·         Corporate Users especially are interested to see up special views of their content instead of queuing up to their IT departments. Publishers have content that can fit into this landscape, especially with tools that provide more visually-oriented solutions.

·         Trend-spotting is key to hedge funds, as they look for emerging trends that they can analyze and monetize through securities investments. Analytics are important, but have to include unstructured streams such as emails. Businesses need trusted information, they need more competitive intelligence, they need the tools to understand vast amounts of information via findability tools as end users push beyond analysts and researchers to user search tools themselves.

·         The need to understand who is the user, what is the task, what location, what will the user do with the search results. They need real-time information, and a re-evaluation of the value that is being provided. Is it the taxonomy that's more valuable to sell, or the targeted content? Is the information about personal relationships in a social network more important?

Points that Susan mentioned that technology vendors need to consider

·         Search and discovery technologies - what can they do beyond the search box?

·         "Fuzzy matching," which enables larger collections of information to be more usable by surfacing good results quickly, even if they're not entirely inclusive answers. She mentioned that we're moving towards surfacing information about better matches, related matches and so on.

·         Other techniques such as understanding sentence, word and paragraph structure, relevance ranking, and supporting ad hoc information access.

·         Tools that can help them get contextual answers - what are related actions that they'd like to provide, are they geo-specific requests, device specific, content by category/entity, are all important.

·         Multi-lingual extraction is also an important feature for any large information provider.

·         Manual versus automated tagging trade-offs - manual approach to tagging provide high precision for low volumes, vice versa for automated. There is a certain "golden mean" in which a combination of automated and manual tagging support can manage high-volume accuracy (comment: think of Dow Jones' new Consultant product, which uses search experts to fine-tune queries of content from Factiva databases for specific topic domains, problems and opportunities.

Solutions of interest mentioned by Susan

·         Illumin8 is an example Sue mentioned from Elsevier that allows innovation professionals to extract information about opportunities for innovation and to insert it into spreadsheets.

·         Information management companies like Iron Mountain are looking at extracting the "atoms" of information, mix them around using advanced search indexes and to make them more reusable. Tagging them thoroughly up front is key to this process, so that they can "talk" across and in applications. For example, if you have a customer, there is information about that customer in many repositories that need to have this information combined and atomized.

·         Bing's Powerset search enrichment tools helps to provide personally contextual information such as restaurants in the area of a location or weather when they understand that a query refers to a location (Google also, obviously). It may also list flights to and from that location, and so on.

·         Temis entity extraction enables this in publications like Nature Publishing Group's online Chemistry publication, providing not only document references but embedded content such as the chemical structure of referenced compounds

·         Attensity360 provides sentiment/opinion extraction to see what people are thinking about products and services. Also important for monitoring traditional media to see the fluctuation of both mentions and sentiment.

·     Autonomy Explore shows clusters of data in a graphical map and tag clouds.

Subscribe to this blog's feed

Follow us on