A New Computer Program Classifies Documents Automatically
As the amount of documents and other data that companies amass continues to grow, employees searching for useful information on internal corporate networks, or intranets, increasingly encounter the same problems facing users of the Internet. Until recently, the primary approaches to dealing with this chaos have been either cataloging by trained librarians, which can be expensive, or semiautomated sorting based on preselected categories, which is less accurate and therefore less useful.
Now, however, two computer scientists at IBM's Almaden Research Center have developed a third way— a computer program that can analyze thousands or even millions of documents and create a taxonomy for the entire collection, with categories, subcategories, sub-subcategories, and so on — that allows users to quickly zero in on documents of interest. The brainchild of Shivakumar Vaithyanathan and Byron Dom, the program, code-named Sabio from the Spanish word for "wise one," actually considers many categorizations and settles on one that is highly efficient in terms of how easily users can find relevant documents. Sabio is being incorporated into a knowledge-management program, code-named Raven, under development by Lotus® that will provide a suite of tools for organizing, searching and viewing information in a company's intranet.
"The inspiration for Sabio came from the Internet search engine Yahoo!®" says Dave Newbold, general manager at Iris Associates, a Lotus subsidiary that develops Dominoª and Lotus Notes®. Yahoo! categorizes Internet sites, and users can search either by working through the organization tree to find the relevant categories or by performing a keyword search, identifying particular sites, and then looking at other sites that fall in the same categories. "A lot of corporations want to do what Yahoo! does, but Yahoo! was created manually, an option that is not feasible for most firms," Newbold says.
Thus, two years ago, during the development of Raven, Lotus held a competition for software that could organize the contents of an intranet into sensible categories, and create a catalog of the contents based on those categories. Lotus extended invitations to a dozen vendors, including teams from IBM, to submit such programs, and eight accepted the challenge. They were then compared on how fast and how accurately they could categorize a set of 72,000 newswire releases on a variety of subjects with nothing to go on but the contents of each release.
The entries relied on a variety of methods. Vaithyanathan and Dom's program employs Bayesian statistics, a powerful technique that allows one to take into account educated guesses about what a solution will be and then modify those guesses according to data.
Sabio's first step, Vaithyanathan explains, is to decompose each document into a collection of "tokens," its relevant words and phrases, leaving out such inconsequential components as "and," "the" and "on the other hand" and assembling a collection of all the thousands of relevant words and phrases in all the documents. Sabio treats this collection mathematically as points in a huge multidimensional space, in which each dimension corresponds to a single word or phrase, and the number of times the word or phrase appears determines how far out along the dimension the point lies.
Only when two documents share many of the same words and phrases will they be relatively close together in this multidimensional space. The program therefore regards clusters of nearby points as being about similar subjects. In this respect, Sabio is similar to a number of other categorizing programs, Dom notes, but there is a crucial difference. "With most clustering programs, you have to indicate how many clusters to use and what features (i.e., which words or phrases) to cluster on." By contrast, Sabio figures out not just which documents are assigned to which clusters, but the number of clusters and how each cluster is defined in terms of key words and phrases. Some words and phrases, for instance, will probably turn up with similar frequency in almost all the documents and be of little use in clustering; others will occur regularly in only a certain subset of the documents and be crucial in picking out clusters of related documents; and still others will have a more complicated pattern of occu
rrence. Sabio determines for each set of documents which words and phrases are useful and how they are used to define the clusters.
This flexibility allows Sabio to analyze a set of documents and, unaided by human categorizers, come up with an effective way to group the documents into relevant categories, but it also introduces a tricky complication. If the cluster model — the number of clusters and the collection of key words and phrases — is predefined, it is a relatively simple exercise in statistics to find the best clustering. One simply minimizes the total "scatter," that is, how spread out the documents are inside the clusters in the multidimensional space of words and phrases. But if the cluster model is not fixed, the search for the best clustering must also take into account the complexity of the rules specifying which documents are put into which clusters. This leads to a delicate balancing act.
At one extreme, the scatter can be minimized by having almost as many clusters as documents, with each narrowly defined cluster containing only one or a few very similar documents. Such a complex clustering would be no improvement over the original set of documents, however, since it would not help users to zero in on items of interest. At the other extreme, the simplest possible clustering is to lump all documents into a single large cluster with a very large scatter. It, too, would be of no practical use.
What Sabio does, Dom explains, is to use Bayesian statistical techniques to find the best clustering of documents. In this case, "best" is defined in terms of a balance between keeping the scatter low and the clustering model simple. This is a new approach to clustering, Vaithyanathan says, and it has proved superior to any automated approaches. In tests on document collections that have been categorized by human experts, Sabio's clustering has come quite close to that of the experts, he says, and, as the number of documents increases, Sabio's performance improves.
Once the best way to cluster the documents has been found, Sabio creates a taxonomy by grouping related clusters, again using a balance between scatter and the complexity of the clustering rules to decide which clusters are most closely related. The result is a treelike structure in which a company's documents — the trunk of the tree — are split into perhaps a dozen major categories, or limbs, each of which is divided into branches, subbranches and so on, to the smallest twigs of the tree, which are the individual clusters of documents. To zero in on particular documents, a user simply moves up the tree, following the branches of interest.
All of this is done automatically, but Sabio also allows people to have input into the process in a number of ways. Once the categorization is complete, it is expected that users will review the categories and modify them to better suit themselves; for example, by combining categories or creating new ones or by moving certain documents from one classification to another. In each case, Sabio is programmed to learn from these modifications and to classify future documents accordingly. A user can also show Sabio a set of documents that has been categorized by human experts, and the program will generate a set of rules to produce that categorization and will apply those rules to other documents.
Sabio proved its mettle in the Lotus competition. Most important for Newbold and Lotus, Sabio's classification proved to be both fast and accurate. "It didn't take long to figure out who won the competition," Newbold recalls. "Sabio won hands down, in terms of both speed and accuracy. It was 80 percent faster than its nearest competitor and, although there is not a good numerical measure of accuracy in clustering, Sabio was clearly far better than the others at putting documents into the right categories."
The number of documents that Sabio can categorize is limited only by a company's computing power, Vaithyanathan points out. The accuracy actually improves as the number of documents increases, but very large sets of documents, such as those found in a company the size of IBM, demand a correspondingly powerful computer to handle them.
Sabio, or the Automated Taxonomy Generator (ATG) as it is called in Raven, will be combined with several other features in the Lotus product. Besides ATG, Raven has a full-text search engine that pulls up any document with a given word or words, as well as an expertise locator that uses information in the documents to identify people with knowledge of, or interest in, particular subjects. But the feature most likely to catch the attention of chief information officers is Sabio's ability to bring order to the most chaotic set of corporate e-documents.
Robert Pool is a freelance writer based in Tallahassee, Florida. His most recent book is Beyond Engineering: How Society Shapes Technology.