Untangling the Web
A Collaborative Undertaking
Untangling the Web
A Cooperative Approach to Surfing the Web May
Be the Key to Finding the Most Useful Information
By Steve Mirsky
Oh, what a tangled Web we have woven - - and continue to weave. With more than 100 million pages and growing, the Web is a mountain of information. But extracting useful information can be a daunting task. Stephen C. Gates, a scientist at the Thomas J. Watson Research Center, wants to spare users the frustration of fruitless searching by providing them with potential leads to interesting Web sites about their favorite topics. To that end, he has developed a program that enables individuals to take advantage of the recommendations of other people who have similar interests.
"There is a great deal of information in any organization or group that does not get shared," Gates says. "In particular, there are people who have answers to other people's questions, but in most instances people have no way of sharing that knowledge. My application is a relatively painless way of getting people to share information about valuable resources on the Web and making it available to all users."
Such added value would clearly be welcome for Web surfers. A typical search, even an advanced one with multiple search terms, may turn up hundreds or thousands of pages, the vast majority of which are likely to be irrelevant or of low quality. For example, an AltaVista(TM) search on "intelligent agents" results in 16,731 hits. Even a more focused search on "intelligent agents" + "design" produces more than 8,000 hits. The user has little or no way to tell which of these will be useful.
Ideally, what such a Web searcher would like to do is to zero in on exactly those sites of interest to people concerned with designing intelligent agents and ignore the rest. Gates's program aims to enable users to do just that, by providing a means for people to share their Web-related expertise on any subject. The underlying process is known as collaborative filtering. In one form of collaborative filtering, users rank the value of information - such as Web sites - so that other users can benefit from the group's judgments.
"The whole point of a collaborative filtering system is to take advantage of the fact that other people have found things you haven't found yet and have expended some effort sorting out the more useful from the less useful," Gates says. In effect, users share each other's recommendations, thereby learning about Web pages whose existence one might never have guessed, as well as avoiding wasted time spent exploring unproductive sites. Such recommender systems, as they are also called, only work if the group among which information is shared does in fact have common interests. Gates's system creates "virtual groups" of people with shared interests by looking at their voting patterns.
Many of the existing recommender systems, such as Firefly, share information on people's tastes in certain subjects, for example, music and film. "My system, however, is intended to apply to the entire Web," says Gates. The remarkable feature of recommender systems is that users can share knowledge without directly communicating it to others. Rather, it is the system that keeps track of the users' opinions and infers from them which information should be provided to which users.
That doesn't mean that individuals don't have to do anything at all to benefit from the system. Each time they visit a site, users are asked to rate it based on its usefulness. That contribution on the part of users is crucial, because without it there is nothing to share. "Everyone wants the information; that's not the hard part," says Gates. "The hard part is getting them to share their knowledge."
Voting, therefore, has to be easy or people won't do it, especially since their reward - the higher quality of the recommendations that they themselves would eventually receive - only emerges over time. To keep the barriers to participating in that communal effort as low as possible, Gates chose a simple 1 to 4 rating scale, where 3 and 4 mean that the user found the page either somewhat or very useful, and 1 or 2, not at all or just slightly useful. Wizard, Hot, Mild and Not are the shorthand descriptives attached to the numerical ratings.
Voting, which thus requires a fraction of a second for a mouse click, takes place on a special window. Along with "buttons" for registering one's vote, the screen contains other useful information about the page, such as the average vote of all users and one's own previous vote, if it exists.
Finally, a list of Web articles topically related to the page one has just evaluated is displayed. In addition, after one has voted a few times, the system is able to place a person in a "virtual group" of people with similar interests, based on their voting behavior. Articles highly ranked by members of the group are shared, thereby revealing links of most interest to the people with similar tastes in a particular subject area.
Privacy
Obviously, any system with the power to profile users raises the issue of privacy. However, Gates points out, the system does not reveal what kind of person you are to other Web users or to advertisers. "I match you up with people like you, without telling you who your virtual friends are," Gates explains. "Complete anonymity is necessary in this system, so people can come and go without being publicly identified. I may be voting on perfectly innocent things, but I may not want my voting record out there. We want to maintain really strict privacy controls."
The ability to protect users' privacy while allowing advertisers to reach people with specific tastes and interests is what would allow the system to be made available for free to users. Advertisers, knowing that they can send an ad for their products directly to the people who have exhibited interest in that subject area, would be willing to subsidize the service. The advantage to the advertisers is that they get the chance to join a "targeted" virtual group. "But they will not know who you are," Gates emphasizes.
With its benefits to users, information providers and advertisers, the recommender system promises to be not only an extraordinarily powerful tool for untangling the Web, but a boon for e-business, as well.
Steve Mirsky is a freelance writer based in
New York City.
A Collaborative Undertaking
The recommender system that Steve Gates is working on is itself a collaborative undertaking that incorporates work done elsewhere in Research. A key component is WBI, for Web Browser Intelligence. Developed at the Almaden Research Center by a team led by Rob Barrett, WBI is what is called an intelligent proxy, an intermediary that exists between the user's Web browser and the Web itself. The intermediary receives requests from the browser and then manipulates the data to enhance the value for the particular user.
But it is even more general than that. "WBI," explains Barrett, "allows a software developer to create novel Web applications and to deploy them. As such, it functions as Web middleware that can be used for a wide variety of tasks. In the case of Gates's recommender system, WBI customizes the user's view of the Web by enabling personalized recommendations and a voting system." Many other applications have been also been built with WBI, including document format conversions, personal histories and password management.
Another group at Almaden led by Byron Dom has also contributed to the underlying technology of the system. "The goals of our research are to find the most valuable sites based on the internal relations among Web sites and to find better ways to categorize the existing information," says Dom. Both of these objectives have resulted in algorithms used in Gates's recommender system.
The first algorithm exploits the mutual endorsements between "authorities" and "hubs." The former are Web sites frequently cited by other sites and therefore having many links leading to them, whereas the "hubs" are pages containing collections of resources on a topic of broad interest that point to the authorities.
In collaboration with members of Dom's group, Jon Kleinberg, now on the faculty at Cornell University, developed an algorithm that examines the link structure to determine the importance and identities of these authorities and hubs. For example, when the sites listed under a search for "java" are analyzed by this technique, it chooses Sun's Java site and the Gamelan applet repository as the two of most important authorities, out of the millions of pages containing the word "java."
Meanwhile, Soumen Chakrabarti, in collaboration with Dom and Piotr Indyk, has been developing a hierarchical topic analyzer called TAPER (for Taxonomy And Path Enhanced Retrieval). "TAPER achieves high speed and accuracy by means of two techniques," says Chakrabarti. "First, at each node in the topic directory, TAPER identifies a few words that, statistically, are the best indicator of the subject of a document. It then 'tunes in' to only those words in new documents and ignores 'noise' words. The second technique guesses the topic of a page based not only on its content but also on the contents of pages in its hyperlink neighborhood." Compared to previous systems, TAPER substantially improves the accuracy of Web topic assignment.