IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Paw and Order on the Intranet
COVER STORY: Part 4

By Chuck Boyer

WebCat will help turn unruly corporate intranets into well-categorized resources whose information can readily be found

Knowledge management begins at home, and for IBM that means capturing and categorizing some 2 million documents in the company's own intranet, starting with 100,000 Web pages in IBM Research alone. To do this by hand would be costly and laborious. Instead, researchers at IBM's Thomas J. Watson Research Center are developing a suite of tools that will automatically sift through Web pages and assign documents to a hierarchical taxonomy. Called WebCat, because it categorizes the Web, it should enable companies to bring browse and search capabilities to their intranets at a fraction of the cost and time needed to develop them manually. Some of the technology underlying WebCat is already on its way to becoming commercially available.

Like the Internet's Yahoo!®, WebCat lets users navigate and search through a hierarchy of information. "The ultimate goal," says Eric Brown, a researcher in the information design and access group, "is to help people find what they are looking for in a simple, intuitive and efficient way." The Watson team is developing tools in WebCat that help system administrators define a taxonomy, starting either with a generalized notion of the categories they want or with a series of existing documents. From there, WebCat will automatically assign documents into the taxonomy and enable end users to find information by navigating to topics of interest or searching within selected sub-branches of the taxonomy.

Once an administrator has defined a taxonomy, a collection of training documents must be categorized manually and fully indexed. The remaining documents are then categorized using a type of software known as a k-Nearest Neighbor (kNN) classifier, which compares the uncategorized material with the already categorized training documents.

The "k" closest matches with the training documents, usually about 50 in number, are used to rank the possible categories in which the new documents should be placed, depending on the frequency of key words and phrases. Typically, any one document will find itself in multiple categories. The degree to which this is so can be adjusted according to the desired level of accuracy and coverage. Better coverage means users are more likely to find a document even if they don't select the best category. Better accuracy means that the documents listed in a particular category are more likely to be relevant to that category. If the categorizer is not used in a fully automatic mode, it can present a human classifier with a ranked list of categories from which to choose.

The number of training documents and the care with which they have been categorized is key to how effectively and efficiently the categorization system will work. The researchers expect to index at least 2,600 training documents by hand. From this core, WebCat should be able to automatically place the balance of IBM Research's intranet documents into about 330 categories.

STALKING EFFICIENCY

Setting up the taxonomy and populating it with training documents is, says team member John Prager, a challenge in itself. So the team is developing tools to help find appropriate training documents and relate them, through a scoring mechanism, to each category. The team is also developing tools to help administrators apply WebCat efficiently. Among them are tools for managing and manipulating the taxonomy, assigning or relocating training documents, generating reports about the taxonomy and coordinating work among multiple team members.

Another benefit of the taxonomy developed by WebCat is that it will quickly uncover "holes" in an online document collection -- subjects that are referenced frequently by other intranet pages but fail to show up as distinct categories in the taxonomy. This might happen if, for example, one department failed to make its papers available on the intranet -- an eventuality that team member Edward So predicts will become less common as word of WebCat's usefulness spreads.

Authors at IBM Research will soon be encouraged to categorize their own documents. By inserting category metatags into their documents, authors can greatly improve a document's chance of being correctly categorized. WebCat will even incorporate tools that suggest appropriate categories to the author. With further development, WebCat will automatically unearth mislabeled documents and reassign them properly. Human judgment, of course, will be allowed to override the automatic process.

A further WebCat innovation is its use of hyperlinks. In addition to examining document text, it will search and categorize by hyperlinks and by the information that those links lead to. "Some pages, such as a table of contents, might not have enough text for reliable automatic categorization," says Brown. "But they might link to related pages with more textual content, which can be used to categorize the original page."

FLEETER STILL

Another tool under development for WebCat is a means of clustering documents into groups, and then assigning each group to a relevant category. This will improve the speed at which training documents can be developed and all the documents in a system can be categorized. Clustering uncategorized documents or even documents in a heavily populated category can reveal new topics that should be added to the taxonomy. Moreover, the documents in a cluster represent a natural training set for the category. And once documents have been clustered, it may be necessary to categorize only a few sample documents in order to place the entire cluster in the proper category. Another idea on the table is to enable users to continually suggest new categories, refining the taxonomy as they go.

WebCat is expected to be ready for trials on the Research intranet in early 1999. The next step will be to develop the thousands of training documents needed to seed and organize a taxonomy for those 2 million documents now available on the entire IBM intranet -- somewhere.


Chuck Boyer, former editor of IBM's Think magazine, is a freelance writer living in Maine.


    About IBMPrivacyContact