IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Mining Documents for Data
COVER STORY: Part 3

By Tom Halfhill

New software brings to ordinary text files the search capabilities of the most sophisticated databases -- and then some

One of the ironies of modern computing is that, as online information becomes more plentiful, it gets harder to find the information you need. But Intelligent Miner for Text, a new data mining tool, can change all that. Developed by an IBM Software Solutions group in Böblingen, Germany, in cooperation with several Research teams, the product began shipping in early 1998.

Unlike conventional data mining tools, which operate on structured databases -- with information organized in neatly defined fields and records -- the new product works on ordinary text files. It can search, analyze, organize and index all kinds of free-form documents: email messages, memos, reports, notes, outlines, business plans, technical papers, documentation, Web pages and more. It automatically generates metadata -- structured information about the documents -- which helps turn a mass of text files into a database of useful information. "This is very important for knowledge management because it helps people sift through what amounts to noise," says Alan Marwick, senior manager of text analysis and advanced search tools at Watson.

Although Intelligent Miner for Text can analyze documents written in a variety of languages, there are still some it cannot handle. One of the first steps, therefore, is to identify the language in which a document is written. That is done with a component called the Human Language Recognizer -- based on the work of John Prager at the Thomas J. Watson Research Center -- which can recognize a language from clues such as frequently used words and letter sequences.

Another module -- the Feature Extractor, based on the Textract tool developed by a Watson team led by Roy Byrd -- consists of two components. One, nicknamed the Nominator, identifies proper nouns, while its counterpart, the Terminator, finds multiword terms such as "laser printer." Identifying proper nouns and terms is a prerequisite to classifying a document, because it allows another module called the clustering tool to sort the documents by topic.

Clustering is the key to finding related documents in large masses of text. The clustering tool, called HiClust, was developed at the Haifa Research Laboratory and refined at Böblingen. It works by grouping documents that discuss the same topic. That not only facilitates browsing, but it can help refine search results, especially when the query is fuzzy or ambiguous. For instance, if one enters the search term "Merced," almost any Web search engine will return a list of documents, some relating to Merced County in California and others to the new Intel chip architecture. HiClust will automatically cluster the results, separating one group from the other.

Intelligent Miner for Text also includes utilities that bring knowledge management to Web browsing. Built on work by Rob Barrett, a research staff member at Almaden, they can save all the Web pages you've ever visited, for later classification and reference. The utilities can automatically customize Web pages for individual usage patterns -- say, adding links to relevant sites you've viewed before.

INTELLIGENCE TO COME

Despite its many capabilities, Intelligent Miner for Text is still developing. Future versions, says Marwick, not only will support additional natural languages but will have even more sophisticated tools for searching and organizing unstructured documents.

As part of the knowledge base on which future versions will be built, Watson researchers are pursuing many aspects of the underlying technologies. One outcome is a component that summarizes large documents by extracting the most important sentences, as determined by the frequency of certain words and other language patterns.

A new Context Thesaurus helps focus a user's search. If the term "mustang," say, is entered, the user might be prompted to choose "horse," "Ford" or "airplane," to rule out irrelevant documents.

Another development finds relationships between markedly different terms. For instance, it might figure out that "toner cartridge" is related to "laser printer" if both terms frequently occur in the same documents. Later, a search for all documents related to laser printers would also retrieve those that mention toner cartridges, even if "laser printer" never appears in those documents.


Tom R. Halfhill, a former senior editor of Byte magazine, lives near San Francisco.




    About IBMPrivacyContact