IBM's latest search technology sniffs out the best sites.
In cyberspace,
sorting out relevant information from the babble can be a frustrating exercise. That's the challenge that researchers
at IBM's Almaden Research Center took on three years ago when they teamed up to create a discriminating search engine
called Clever. The prototype Clever system uses algorithmic methods to evaluate Web sites by examining the sites
to which they link and from which they are linked. As a result, says Byron Dom, manager of information management principles
and a member of the Clever team, "The system finds the good stuff on the Web."
Ordinary search engines typically attempt
to rank search results in a useful way, but they are easily fooled. They rank results according to a scoring function based on whether a term appears in a title, how close it appears to the beginning of the document and how many times it appears in that document. Important resources can fall
through the cracks and superficial results get more prominent ranking than they should. IBM's home page, for example,
doesn't contain the word "computer," so a search for that word wouldn't yield IBM.com. Yet a Web site for a small-time retailer with many references to computers could rank high on the list. Some Web developers even embed phrases or words written over and over in invisible colors and fonts to make their sites score artificially high on a search finding.
Web directories like Yahoo!® -- which organize sites according to a taxonomy, or hierarchy of topics -- are more selective; they can produce a ranking of useful Web sites. But because they employ human beings to do the selecting, such directories are costly and time-consuming to assemble.
In contrast, Clever attempts to ensure that the information it retrieves is useful by pointing people toward either of two classes of sites: authorities and hubs. An authority is a site to which many other sites have links, which Dom sees as implied endorsements of the site's usefulness. A hub is a site that has links to many other sites, and is therefore a potentially good reference. Clever's job is to identify the best hubs (those that link to the best authorities) and the best authorities (those that are linked to by the best hubs).
Ranking hubs and authorities in this fashion is an iterative process. Clever starts with 200 pages that are the result of an ordinary keyword search. It then adds all pages that link to, or are linked to by, one of those 200 pages. This step typically swells the set of pages to 1,000 or more. Clever initially assigns each page a hub score of one and an authority score of one. It sums up all the authority scores to get a page's hub score, and sums up all the hub scores to get a page's authority score. "Then," says Dom, "we go back and forth," repeating the process some five times until the system has identified the hubs that link to the top-scoring authorities and the authorities that are linked to by the top-scoring hubs
PUTTING CLEVER TO WORK
One implementation of Clever, called TaxMan, for Taxonomy Manager, is a user interface for people who want to build and maintain a taxonomy
for a particular field. TaxMan adds search features specific to that field and acquires and updates pages automatically.
Besides looking for key words, the program can hone a search by attempting to match example sites and by excluding sites
like Yahoo! that come up often in links but that distort the search process. Once set up, the system can run once a week
to eliminate dead pages (which are identified by messages such as "server not found") and find pages that have risen in rank.
IBM's Global Procurement (GP) group is running TaxMan in a pilot program designed to build and maintain a directory of Web pages related to commodities, such as DRAMs and memory, that IBM uses in manufacturing. GP is using TaxMan to set up commodity profiles to search for pertinent sites on the Web for their commodity teams. Current searches yield far more information than the teams can sort through and GP is hoping that TaxMan will help build profiles more efficiently.
Clever is also the backbone of two other Almaden projects. One of them is designed to discover communities of people with common
interests (see sidebar,"Tracking online communities"). The other project, Focused Crawler, is a high-end version of Clever that
uses an automatic document categorizer to recognize pages relevant to a particular topic. A system administrator feeds the
categorizer exemplary documents and trains it to recognize documents that are similar. The crawler can take days to run but
will return a comprehensive set of pages on a specific set of topics. A focused crawl on baseball, for instance, would begin
with a set of documents about baseball. The categorizer would then use the terms and links within the set to build a model
of what constitutes a page about baseball. The crawler follows links from this training set to find more pages about baseball, using the categorizer to assess the relevance of each page to the topic. The crawler then follows the links from pages in this new set according to each page's relevance and the rating Clever assigns it as a hub.
Dom says the Focused Crawler is designed for anyone who wants a comprehensive set of pages about a given set of topics. It was developed to support a special-purpose portal that notifies users when a new page appears about a particular subject.
Clever in its basic form is undergoing trials on IBM's intranet and is already being used by several IBM business units.
Rebecca Day is a technology writer who lives in Chestnut Ridge, New York.
SIDEBAR: Tracking online communities
SIDEBAR: Tracking online communities
IBM's Clever search technology can do more than simply
hunt down useful Web sites. Another application of link analysis makes it possible to identify online special-interest communities as they form, and to track their evolution. This is the intent of a process called trawling developed at IBM's Almaden Research Center. "We think of it as a tool to discover when interests emerge on the Web," says Andrew Tomkins, a member of the Clever team. Such a tool could find a range of uses, including finely targeted marketing.
Communities arise as a result of shared interests. For example, a person in New York might create a Web page about Georgian coffee tables. Someone in Latvia might do likewise. Chances are, the two enthusiasts will list hyperlinks to many of the same sites. "As soon as there are a couple of people who care about this topic, our mining tool enables us to notice that automatically," Tomkins says. "We believe that our technique would allow us to discover a Georgian coffee-table community even before the New Yorker and the Latvian knew that one was forming."
Trawling works by sifting through data uncovered by a Web crawler, a program that finds Web pages and their network of links. Web-based communities are identified by commonalities of links. "That's the footprint of a community -- the fact that several people choose to link to the same set of pages," says Tomkins. "Once we've found the tiny 'core' of a new community -- which is often as few as five pages -- we can use Clever to grow that core into the entire community."
Because trawling is an automatic process, it can uncover communities much faster and on a much larger scale than a directory like Yahoo!, which relies on human specialists to identify and research Web sites and links. "We can find a community on average within a week of its emerging," Tomkins says. For a page to be listed in Yahoo!, by contrast, can take up to a year because of the sheer volume of information to be sorted through. "Up to a million pages are created from scratch every day," Tomkins notes.
The applications for trawling are broad. Not only will companies be able to pitch to a tightly defined market, but individuals or groups will easily be able to find other people with similar interests or causes. For example, says Tomkins, "We were surprised to find there was an Internet community of people who are concerned about Japanese offshore oil spills. That could be of interest to firms selling products in that area, or to activist groups who want to send out information, or to image-conscious companies that want to be aware when grass-roots action is starting."
The ability to track the evolution of an Internet community is also useful from a sociological point of view. Able to constantly monitor the emergence and scattering of communities over time, Internet experts can learn more about the rate at which people create content for the Web and the rate at which that content shifts. "The communities tend to change over time, and our tool lets us know how quickly that happens -- when new ones pop up and old ones fade away," Tomkins says. For example, he notes, a huge community formed around Monica Lewinsky last year and is now dying off. "Some communities will be marginal, and some will be hugely popular," he says. "If your goal is to make money on the Web, you need to know where people are spending their time and what their behavioral patterns are."
The first trawl of the Web, which took three days on a PC, identified a quarter of a million communities, far more than the 20,000 or so found on commercial Web directories. This was partly because trawling is more sensitive to the formation of nascent communities and partly because it manages to identify smaller communities within the large communities listed by the commercial directories. Of the sampled communities, more than half were not in Yahoo! in any form at the time of the crawl, and 18 months later Yahoo! had uncovered only about two-thirds of them, suggesting that trawling does a better job of identifying nascent communities than human researchers do. Tomkins and his colleagues are now refining the tool's applications.