Given a natural language English corpora, UIMA has been used to implement advanced search applications that employ a host of NLP techniques to take in natural language questions, rather than just keywords, and produce precise answers represented in the corpora, rather than a list of documents. Other applications are in life sciences where UIMA was used to combine a set of text analytics for finding chemical compound names in medical abstracts (which have many variants) and then cross-referencing them with patent filings to find compounds of particular interest. UIMA has been used in a number of government engagements focused on deep text analysis and machine translation. Another hot area for the technology is in applications for helping operators to quickly find relevant technical documentation to reduce the response time at call centers.
UIMA is an extensible and scaleable framework that supports an application from acquisition of original unstructured information (e.g., text, audio, video) through its analysis and, ultimately, to the population of structured resources (e.g., databases, search engine indices, knowledge bases).
This is accomplished by the definition and implementation of:
1)A pluggable analysis engine framework with multiple levels of interface compliance for easy integration, composition and deployment of new and existing NLP capabilities and other analytics
2)The UIMA collection processing architecture and the collection processing manager (CPM). These facilities allow the developer to implement defined interfaces for pluggable source acquisition components called collection readers and analysis sink components called CAS consumers. The common analysis structure (CAS) is the core UIMA data structure for representing analysis results. The CPM provides the run-time infrastructure for running UIMA applications that include a flow from collection readers through analysis engines to CAS consumers.
UIMA Application Overview
Multi-Level Compliance
The ability to combine analytics has been a traditional barrier for the scientific community on the road to advancing the state-of-the-art and for the effective application of NLP because combination requires a highly flexible architecture and framework that allows for easy adoption and integration of independently developed analysis components. UIMA affords multiple levels of interface compliance, the first of which affords easy adoption and fully enables combination. The highest level of compliance further admits advanced development tooling and high-performance integration.
Embedding Framework
Previous approaches resulted in stand-alone architectures and fragile implementations that re-invented or neglected systems middleware functions. UIMA’s stand-alone implementation is designed to allow embedding on existing run-time environments where robust system middleware services are inherited rather than neglected or replaced with inferior implementations. UIMA’s design and reference implementation provide a stand-alone runtime as well as the ability to embed in different software platforms and/or using existing middleware through its embedding framework.
UIMA also supports multi-modal analytics allowing the developer to generate and process multiple views of the same artifact each composed of a different modality (e.g., one text, one audio, one video). This was achieved by a number of extensions to the UIMA CAS so that, for example, a video stream can be decomposed into its audio track, close captions, and video track and all these can be represented as different views of the same original artifact.
UIMA is an integrating platform that allows NLP researchers to perform more efficiently and to rapidly share each others technologies to accelerate scientific results.
In addition to the architecture and software framework, an software development toolkit (SDK) has been developed. By mid-2003, there was a sizable and growing user community within IBM Research and , in 2004, DARPA sponsored a workshop with some of the top NLP researchers from academia and industry to try out the solutions. The overwhelming consensus was that UIMA was technically well-positioned to become a de facto standard for integrating and deploying NLP technologies.
DARPA then formed a UIMA working group with members from IBM, Carnegie Mellon University, the University of Massachusetts, Stanford, MIT, MITRE, object sciences, Columbia to help advance the UIMA architecture. The UIMA framework has been embedded in IBM products including WebSphere Portal Server (WPS), LWP, and IBM's new enterprise search product, Omnifind, enabling them to plug in NLP technologies like document tokenizers, classifiers and summarizers. The UIMA SDK is now widely used across IBM Research by NLP researchers and UIM application researchers. It was also published on AlphaWorks in December of last year, and had over 200 downloads from a variety of external companies in less than two months.
Related Publications & Information
IBM Systems Journal: Unstructured Information Management
WebProNews.com "IBM’s Approach To Enterprise Search" (February 2005)
eWeek "IBM Preps Enterprise Search Update" (February 2005)
New York Times "At IBM, Google is so Yesterday" (December 2004)
Technology Review "Computers that Speak Your Language" (June 2003), Wade Rush
Rate this article

Adam Lally Researcher 




















