 |

A tangled web of business information
| |
A small U.S. company is surprised to find that it's suddenly getting media coverage in a different part of the world. A multi-national company wants an early warning system that can monitor and quickly identify potential public relations issues as they arise. A record company wants to allocate its marketing funds effectively, but is having a hard time gauging which of its artists is generating the most buzz among teens.
People know that the Web can provide information vital to their businesses, but many don't have the time or expertise to find it. In today's speed-of-light, information-saturated business climate, accessing the appropriate knowledge can mean the difference between survival and failure. Most companies could benefit from the ability to gather every piece of relevant data and quickly tease out the significant facts, relationships, context and overall meaning. But is this possible given the infinite tangle of information available on the Web?
IBM's Almaden Research Center in San Jose is making this seemingly unreachable goal a reality. Chief scientist Andrew Tomkins and his colleagues are developing a technology called WebFountain that mines Web data and transforms it into valuable information.
Far more powerful than search engines, which present a list of Web pages that contain certain queried words or phrases, WebFountain collects information from the Web on a continuous basis and organizes it into a coherent format, which allows sophisticated users to do complex queries of specific, categorized information.
This task requires serious computing muscle: 50 machines dedicated to crawling text from the Web (including blogs, chats, newspapers, enterprise data and trade journals, along with licensed content and proprietary data) feed five million entities per day into more than 500 processors that manage half a petabyte of data. And there's still room to grow.
There's no doubt that the extra room will be needed. According to a study by the University of California at Berkeley's School of Information Management and Systems, the world's population generated five exabytes of information in 2002. That's the equivalent of a half-million new libraries the size of the Library of Congress-and more than 90 percent of that information is digital. "With respect to a specific user request, there might be tiny little pieces of data buried here and there all through personal home pages, Weblogs, bulletin boards and so forth," says Tomkins. "We find what's important and add context to the content."
Business Needs
IBM is currently beta-testing WebFountain with several clients to show how the technology can expose untapped market opportunities and identify competitive threats and other business risks. For one of these clients-a large recording company-WebFountain is monitoring the buzz about bands on high school Web sites around the world.
"We found that this low-level buzz around an artist was a leading indicator of the movement of that artist on the Billboard 200 [Billboard magazine's listing of the top 200 pop music albums]," says Bob Carlson, the IBM vice president who heads WebFountain marketing. "And that allows the marketing people to make better decisions on the investment of their resources."
Companies in the computer, financial services, pharmaceutical and distribution fields are also testing WebFountain. The system is tailored to each company's specific needs in terms of products, customers and, crucially, geography.
WebFountain's multilingual capability is instrumental in allowing it to customize information geographically and offer clients the ability to tap into information in non-English-speaking resources, such as chat rooms and bulletin boards. Thanks to researchers at IBM's Watson Research Center in Yorktown Heights, N.Y. WebFountain can read five languages -- English, Chinese, Arabic, French and Spanish -- and is in the process of expanding its language understanding to include Russian and Portuguese.
Interacting with WebFountain
Clients can work with WebFountain through software applications that serve as interfaces between the user and the computer, much as word processing and spreadsheet programs do. These applications facilitate an interactive process in which the user can make decisions based on the information provided by WebFountain's instantaneous access to tremendous amounts of data.
For example, the user may notice a blip in a tracked trend -- an increase in album sales for a new musician, for instance -- and want to know what's causing it. WebFountain might respond that it has found increased coverage of the musician in Western European newspapers. The user may then ask the system to determine if there were any announcements or events that may have contributed to that increased exposure. Or, the user may choose to ask WebFountain about the terms used to describe the musician and if those same terms are applicable to competing artists.
"The system provides a way to navigate through the landscape and understand what's going on at a level that is deeper than just viewing a Web page," explains Tomkins.
Patterns on the Web
In 1999, three Almaden-based IBM researchers specializing in the growth of the Web set out to build a system that would monitor the Web continuously and maintain a repository of current data. They believed that this repository would enable them to study patterns and relationships hidden in that information. IBM granted funding for this substantial initiative with the condition that the researchers make the results available to the thousands of IBM researchers around the world.
"From there, we started thinking of it not so much as a collection of raw data, but as a platform that would let people build on one another's work," says Tomkins.
"Many people do large-scale data analysis," he explains. "Yet, for the most part, they spend 90 percent of their time dealing with terabytes of data and 10 percent of their time actually doing the analyses. We wanted to do the 90 percent of the work just once, and allow people to build on what we produced."
They soon realized that the ability to run analyses on the huge amount of data available on the Web would be extremely valuable to IBM's clients as well as to its researchers.
With that in mind, IBM has incorporated WebFountain technology into its On Demand Innovation Services and is partnering with industry leaders, such as Factiva, to provide advanced applications powered by WebFountain. Companies that don't want to invest in the massive amounts of hardware necessary to run WebFountain may instead choose to pay a periodic fee to access the system.
Future capabilities might include mining other rich media, like voice and video, or allowing individual consumers to access the service, as with search engines today. But for now, the WebFountain team is focused on showing the value that corporations can find in mountains of disparate data. "It is truly a new entrant into the world of e-business on demand," says Carlson.