The Knowledge Rush

Unstructured Information – The Knowledge Rush

Knowledge

Unstructured information represents the largest, most current and fastest growing source of knowledge available to businesses and governments world-wide.

The web is just the tip of the iceberg. Consider the droves of corporate documentation ranging from best-practices, technical reports, problem reports, customer communications and contracts to emails and voice mails. In these mounds of natural language artifacts often lie the nuggets of knowledge critical for realizing important trends, creating new opportunities, solving problems or preventing disasters.

  • Shaving off just seconds per call to find the right technical documentation in call-centers can save millions.

  • Rapidly detecting emerging trends in problem-reports coming in from all over the globe can avoid recalls and save companies and their customers millions if not billions.

  • Detecting otherwise unrealized drug interactions through analyzing the linkages in of medical abstracts can help prevent disaster as well as help discover new drugs or cures.

  • Analyzing communications linked to terrorist networks in the form of multi-lingual text or other modalities can help uncover plots threatening national security before they happen.

  • Analyzing SEC reports to help evaluate corporate financial positions

  • And many more applications…

Applications like these, which rely on the rapid discovery of vital knowledge, require the analysis of unstructured information. This is all the information that has NOT been carefully encoded in enterprise databases but rather exists as natural language text, speech or video.

Unstructured information includes the documents found on the web, plus an estimated 80% of the information generated by enterprises around the world. The principal challenge with unstructured information is that it needs to be analyzed in order to identify, locate and relate the entities and relationships of interest – discover the vital knowledge contained therein.

Once these entitles and relationships are detected they may be indexed in structured forms so that powerful search technologies like search engines and database engines can efficiently find the knowledge you need, when you need it.

The bridge from the unstructured world to the structured is enabled by the software agents that do the analysis. These can scan a text document, for example, and pull out chemical names and their interactions, or identify events, locations, products, opinions about products, problems, methods etc. UIMA calls these software agents – analysis engines.

The Essential Analysis — A Best of Breed Integration

There are all kinds of analysis engines being developed in industry and academia. Each tends to be highly specialized in solving small and different parts of an overall solution. Some engines, for example, specialize in breaking up documents into individual words (simple perhaps for white-space delimited languages but Chinese, for example, is another story). This is just the first step in the process.

To accurately detect and classify domain-specific knowledge, deeper analysis is required. This may depend on part-of-speech detection, grammatical parsing and named-entity recognition where proper names, organizations and locations are identified. Other engines may specialize in detecting events and times and then others work on detecting relationships between these elements. A variety of techniques may be used to develop these specialized engines including rule-based and statistical machine learning algorithms.

Analysis engines may vary along a variety of dimensions including document modality (text, speech, video), format, natural language, style, domain. And they may make different performance tradeoff favoring for example, precision over speed or recall over precision.

The critical point is that to develop a complete solution that takes you from unstructured information to usable knowledge you must integrate a variety of independently developed analysis engines. These must be integrated to perform a comprehensive analysis task and then their results must be funneled into systems that allow users to rapidly find and exploit the discovered knowledge, for example, search engines, databases and/or knowledge bases.

IBM’s Mission in Unstructured Information Management (UIM)

An Unstructured Information Management (UIM) solution may be generally characterized as a software system that analyzes large volumes of unstructured information (text, audio, video, images, etc.) to discover, organize and deliver relevant knowledge to the client or application end-user. An example is an application that processes millions of medical abstracts to discover critical drug interactions. Another example is an application that processes tens of millions of documents to discover key evidence indicating probable competitive threats.

Acknowledging the tremendous value in unstructured information sources, IBM products and services centered about information integration are powered by UIMA and positioned to leverage increasingly sophisticated analytics to deliver greater and greater value to our customers.

Analysis engines and related resources for building UIM solutions will come from a wide variety of vendors. IBM wants to ensure that our products and services can exploit best-of-breed combinations of these technologies to deliver the best end-to-end solutions to our customers.

We want to encourage Apache UIMA’s broad adoption to cultivate a world-wide community focused on the development, refinement and integration of advanced analysis technologies that will help enhance our solutions and drive this industry forward.