IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Information on the fast track
COVER STORY: Mastering the Web:New Tools Tailor Information to User's Needs

By Bruce Schechter

Worried that your web searches produce thousands of irrelevant items? A new IBM system called grand central station takes an active role in providing the precise information you require as soon as it becomes available

Ever since the Digital Big Bang - that is, the creation of the World Wide Web - the cyber universe has been expanding at a rate that makes the expansion of our cosmos of stars and galaxies look sluggish. The Internet has doubled in size every year for the past eight years and shows no sign of slowing down. The World Wide Web, the most colorful part of the Internet, now contains more than 10 terabytes of information. Hundreds of times more information is spinning on the hard drives of computers that can be reached via the Internet. Some of this information is public, and some resides on private intranets that users can access only through corporate firewalls.

Hidden in this galaxy of data are facts that can help you stay ahead of the competition, save you money, inform you, amuse you or simply keep you occupied. The problem is keeping up with the useful information without drowning in a sea of irrelevancies. The current tools for finding information on the Internet are known as search engines. Most require one o formulate a query - for example, "Find all articles in which IBM's Research labs are mentioned." The inquirer then runs the query and sifts through the results.

A typical query returns hundreds of results, most of them irrelevant. "If you do a query on 'rainbow' and 'weather' using AltaVista (TM) [a leading search engine from Digital Equipment Corp.]," says Norm Pass, a manager at IBM's Almaden Research Center, "you can end up getting references to bridges!" Efforts to narrow the scope of searches, and hence obtain more refined answers to queries (see below), can have the opposite effect of missing available information.

To make matters worse, keeping track of changing information requires that searches be repeated frequently. Most people simply do not have the time. "If you didn't do that this morning," says Almaden's Daniel Ford, "you are behind the game."

Gathering and Summarizing

To help users of the Internet to keep up, Pass, Ford, Qi Lu, Toby Lehman and other Almaden researchers have launched a project called Grand Central Station. "The idea is to enable people to express an interest just in something that's important to them," explains Lehman. "It's a very individual thing. For me, it's knowing the IBM stock price and developments in my project. I'd like to know whenever the project gets mentioned in the external world or internally at IBM. Also, I want to know if my copyright is showing up someplace." Once these individual requests are formulated, Grand Central Station builds a profile of the user and keeps him or her informed whenever something new appears on the digital horizon.

The GCS system consists of two basic components. One constantly gathers and summarizes new information. The other matches this information against the profiles of individual users and delivers the relevant tidbits.

An innovative aspect of GCS is that it does not deal only with the user's desktop machine. The technology can also push the information to devices such as Personal Digital Assistants (PDAs) that have a connection to the Internet. A PDA owner might check the latest sports scores, traffic conditions and weather on he way home from work. The Almaden team calls the concept of having information available as needed "just-in-time information." The parts of GCS that go out and look for information on sales figures, airport directions, patent citations and box scores are devices - actually computer programs running on workstations - known as Gatherers and based on the University of Colorado's Harvest system (see below). To handle the information explosion, GCS splits up the task of searching among several Gatherers. "The idea is that, because the digital universe is big, there's a lot of room for expanding searches," says Ford.

Vacuuming The Web

The first thing a Gatherer does is vacuum up all available information. All search engines currently available on the Internet work in one of two ways. "Crawlers," such as Digital's AltaVista and Wired's HotBot, try to visit every site on the Web, indexing all the information they find there. Hierarchical engines like Yahoo are more like card catalogs. A staff of librarians constantly scans the Web and places sites into an information hierarchy of its own devising.

Crawlers suffer from Pass's "rainbow" syndrome, producing too many irrelevant hits. Hierarchical engines suffer from the opposite problem: they can miss information that does not fit into their carefully constructed schema. But both types of engine share a major shortcoming: they simply ignore most of the information in the digital universe. GCS uses a crawler designed to sniff out obscure information that other search engines miss. The crawler can communicate using most of the popular network protocols, which enables it to access information from a variety of data sources such as Web servers, FTP servers, database systems, news servers and even CICS transaction servers. It can, for example, track down vast file systems on machines in dozens of formats that are not part of the graphical World Wide Web. This data can take the form of corporate presentations, database files, Java bytecode, tape archives and much more. "GCS has really got an attitude," Ford says. "It goes after everything and doesn't stop. It's the Terminator of crawlers."

The crawler passes the information that it discovers to the next part of the Gatherer. Called the Recognizer, this determines what kind of information - database files, web documents, emails, graphics or sounds - the crawler has unearthed. It passes this information on to the Selector. This filters the information to remove irrelevant material before handing it off to the Summarizer.

The Summarizer is actually a collection of plug-in programs (in which the appropriate program is "plugged in" to handle a particular data type) that takes each of the data types the Recognizer can recognize and produces a summary represented in a metadata format known as the SOIF (Summary Object Interchange Format). Future versons of the Summarizer will produce summaries in XML/RDF (eXtended Markup Language/Resource Discovery Format), an emerging standard for metadata representation. The metadata for a Web page, for example, might contain its title, date of creation and an abstract if one is available, or the first paragraph of text if it's not. As new programs are developed that are more intelligent about understanding documents, they can easily be incorporated into the open architecture of GCS.

Regardless of the data type, all SOIF summaries look the same. That makes them easy to collect, classify and search. A Web server associated with each Gatherer makes the SOIFs available to a central component called the Collector. From the SOIFs, the Collector creates a database that is essentially a map of the digital universe (see below). The Collector also makes sure that the Gatherers do not step on each other's toes. For example, when the Gatherer looking for information in North America comes across a link to Japan, it informs the Collector, which passes this information on to the Japan Gatherer. Gatherers are initially assigned by a GCS administrator to specific domains in the digital universe, but over time they may migrate dynamically to distribute the overall load of the system.

Making Matches

The Gatherers and the Collector make up the GCS search engine. The real power of GCS, however, lies in its ability to match this information to the interests and needs of users. A program known as the Profile Engine carries out that task. Starting with the user's queries, it constructs information profiles that it constantly matches against the incoming wave of information. As relevant new bits are uncovered, it distributes them to Administration Servers that deliver them to the client's desktop machine or PDA. As Lu points out, "It was a quite a task building an engine that could match 10,000 user profiles with a million new data items."

At present, Ford admits, the client side of GCS is "a grab bag of ideas." The basic interface looks like a television remote control, since this fits naturally with the metaphor of surfing channels of information. Commercially available systems like PointCast already push channels of information to a user's desktop using the free PointCast browser (available at http://www. pointcast.com). However, those channels are predefined, broad and unfiltered. PointCast has a sports channel, for example, but it doesn't have a Chicago Bulls channel and still less one devoted to Michael Jordan. GCS users can create channels that are exactly as narrow or as broad as they like.

As the user switches from channel to channel, the information scrolls by in "tickers," just like the ticker tapes on Wall Street. The group is exploring means of programming tickers to alert users to important information - stock market crashes or traffic jams, for instance - as it comes in. Users can also specify that some important or useful information be pushed to their PDA or other device.

The quality of the information delivered by GCS will improve with use. This advance stems from a concept known as a Relevance Tracker. Like all search engines, the GCS inevitably delivers a lot of information unrelated to the initial query. Technology being developed by Almaden's Rob Barrett will someday permit GCS to analyze information that the user accepts and rejects, to refine queries and cut down on the irrelevant hits.

Early Adopters

Such refinements lie in the future. The GCS project is less than a year old - though in that time, Ford notes, the Internet has doubled in size. The team has filed about 10 invention disclosures and has produced about 200,000 lines of Java code. A few IBM researchers are already using prototypes of the technology to help with their work. IBM employees and consultants will probably use early versions of GCS, but the Almaden group is looking at applications in the wider world. Pass has talked to representatives of about eight industries who have expressed interest.

However rapidly GCS takes shape, it will not be finished too quickly for Ford. "The world is blossoming with information you cannot keep up with," he says. "In my own life I find out about things that have been going on for six months. They're new to me, but everybody else seems to know about them. What we're trying to do is eliminate that." The way to do so may be to board the information superhighway at Grand Central Station.


Bruce Schechter is a freelance writer based in Los Angeles. He is writing a book on the life and times of Paul Erdös.


More Information:

Queries on the Web: Do You Push or Pull?

Gathering in the Harvest


Queries on the Web: Do You Pull or Push?

It's an informal law of the World Wide Web: the way in which users formulate a query for information from the Web influences the answers they receive. The two basic means of formulation are known as Push and Pull. "Push is only having to ask once for information that is important to you," explains Daniel Ford, a scientist at IBM's Almaden Research Center. "The nature of a query in Push is usually quite broad." That inevitably means that the query elicits a vast number of results, forcing the questioner to spend time sifting through piles of irrelevant information in quest of the required answer. "The opposite is true in Pull," adds Ford. "There, you are usually looking for something quite specific to match your needs of the moment." The danger is that the query may be too specific to pull in the precise information required. Almaden's Grand Central Station technology is designed ultimately to allow users to pull and push information on the Web. Its great advantage is its ability to tailor its searches to the requirements of individual users.


Gathering in the Harvest

Most Web crawlers are fairly unintelligent pieces of software. They find a Web site, suck up all the information it contains and then move on to the next site. With dozens of crawlers roaming the Internet, visiting and revisiting sites, site managers complain that their servers are spending too much time providing crawlers with information - time that they would prefer to devote to their users.

Harvest, a program developed by computer scientists from the University of Colorado at Boulder, aims to solve that problem by separating the wheat from the chaff of the Internet. Using Harvest, a Web site can automatically produce a concise representation of the information on its site. This informational snapshot is then shipped to interested crawlers, avoiding congestion on the server and on the Internet. IBM's Almaden Research Center's Grand Central Station project incorporates both Udi Manber, a member of the team that developed Harvest, and the Harvest concept of an automatically generated information snapshot, known as metadata. The Summary Object Interchange Format (SOIF) that GCS uses to form its map of the digital universe was first used in Harvest. SOIFs are summaries of the information that the GCS Gatherer finds on its trips across the network.

The program uses very little "intelligence" in creating its summaries. It simply extracts such details as title, author's name, data type, and if one is available, the abstract. In the case of text files, all the text is included. GCS provides its users with SOIFs of information that it thinks might be of interest. If the user wants to see the entire document, the SOIF contains a pointer to the original information.




    About IBMPrivacyContact