IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Labeling the Web

By Marguerite Holloway

In Brief:

Internet users can now specify the kind of information they wish to receive via the Net by using a widely accepted technology for attaching descriptive labels to Web pages. The technology has been adopted as an industry standard known as the Platform for Internet Content Selection (PICS), which stipulates how the labels are to be written and interpreted and how a user's preferences are to be expressed. IBM researchers are adapting the technology to make Web searches much more selective and effective than they are now.



We have all heard of the perils of information overload. In contrast with "Internet addiction," information overload wreaks more subtle effects, creating a sense of pervasive angst or apathy in this frenetic Cyber Age. As more and more information becomes available in ever more intricate forms on the Net, the problem promises only to become worse, particularly as it is so difficult to distinguish good, reliable information from the false or outdated variety.

Which is why researchers at IBM and elsewhere are working to develop programs and systems that can provide or interpret what is called metadata: data about the data. Just as you turn to a Zagat Survey(TM) guide to select a restaurant or to Consumer Reports to choose a car, you will be able to find guidance as you sift through sites on the World Wide Web. "Today," says Brent Hailpern, manager of Internet technology, "we have file servers and print servers, but within a few years we'll have reputation servers."

"The big picture is that if information is well described and people's interests, preferences and credentials are well described, then it is more likely that we will be able to give people less information, but information that is more appropriate and reliable," explains Robert J. Schloss of IBM's Thomas J. Watson Research Center. Schloss adds that such metadata will extend well beyond verifying authenticity. For example, it could inform a blind visitor at a certain Web site that the content therein is entirely graphic, but that a text-only version, which can be read with a screen reader, can be found at another site.

Metadata is employed in a technology called filtering, which allows users to set up parameters about the information that they want to receive or see. The recent implementations of filtering were born of the cyberporn controversy. Parental concerns about their children's access to nudity on the Net fueled a media blitz, which culminated in a cover story in Time in July 1995. Worried that the U.S. federal government would seek to regulate the Internet, researchers from IBM, Microsoft, Netscape and many other companies quickly began to collaborate with the World Wide Web Consortium (W3C), a standards organization based at the Massachusetts Institute of Technology; Institut National de Recherche en Informatique et en Automatique, in France; and Keio University, in Japan. The fruit of their effort was the Platform for Internet Content Selection, or PICS.

A Question Of Access

The idea behind PICS is straightforward. Web pages receive a label or tag that rates one or more attributes on a numeric scale. For example, the Recreational Software Advisory Council (RSAC) has devised a set of criteria for ranking the level of violence, nudity, offensive language or sex. Each attribute is rated on a scale from 0 (none) to 4 (a lot). PICS defines a universal notation that allows groups like RSAC to describe who they are and what content attributes they rate so that a PICS filter knows how to match the vocabulary of the label to the user's stated preferences. Users can also be told what they need to know to frame their preferences. In the case of RSAC, owners rate their own Web sites, but RSAC conducts random audits. It heavily fines owners of sites that it finds to be falsely labeled. Other PICS rating systems, such as Canada's Net Shepherd Inc., use independent reviewers to assign ratings, and stores them on its own servers.

PICS labels are transmitted along with content by intelligent servers such as Lotus's Domino Go Webserver(TM) or PointCast Connections, or embedded in HTML for older servers. Browsers that incorporate PICS can be set to filter out content based on values contained in its PICS labels. Internet Explorer contains this feature, and Netscape Navigator®, WebTV(TM) and other clients will have it soon. Proxy servers - such as IBM's forthcoming Javelin proxy server - which sit between a user and the Internet, can also do the filtering.

So far, about 44,000 sites (roughly 2 percent of the World Wide Web) are labeled. Nevertheless, PICS is gaining momentum. In June, the Supreme Court struck down as unconstitutional the 1996 Communications Decency Act, which would have regulated Internet content. In response, the Clinton Administration announced that it supported the industry filtering solution, and would begin to use PICS labels on all federal Web sites.

Schloss, who co-chairs the next-generation PICS working group at W3C, notes that the current applications of the platform are just the beginning. Labeling schemes, obviously, can include any set of information about a site. PICS software was designed to be flexible and to give people the tools that they need to rate sites in whatever way they want.

Schloss sees a range of filters in the future, some of which may enhance what is seen on the Web, and others which may restrict it. For example, employers may set up one kind, to block information from rival companies' recruitment pages, thereby having some control over information in the work environment. In the home, parents may chose to set filters to prevent their children from seeing images they regard as undesirable. Eventually, Schloss says, "we will have a preferences format that you could carry on your smart card. You could plug it into your set-top box in a hotel so you could have the same filtering that you have at home."

Pics For The Future

Schloss, who helped develop the label bureau server protocol - the language that browsers use to contact, for example, Net Shepherd and say, "I'm about to go to this Web address; what do you think of it?" - is working on the next version of PICS. His goal: to enrich the way in which it can assimilate and use labeling information.

In addition, Watson's Amy Katriel has developed software that allows any label bureau organization to create PICS labels easily. IBM recently donated the program, called PICSLE, to W3C for free distribution to members of the consortium.

Once sites or groups of sites are labeled, search engines can use tags to retrieve more specific information. Instead of getting 50,000 hits on your AltaVista(TM) query, you might get only 80; but the metadata descriptor of your choosing would consider all those sites reputable or valuable.

The complications will arise over who defines the metadata, and how much power is assigned to semi-intelligent computer programs, such as agents, to track things down. As Nicholas Negroponte of M.I.T.'s Media Lab has said, "the bits about the bits are more valuable than the bits." Once micro payments become widely available on the Internet (see "Commerce in Cyberspace," Research, Number 2, 1997), reputation servers could become quite lucrative. And brand names may assume increasing importance. For a small fee, a user could, for instance, ask Zagat Survey to add its rating of a restaurant to the restaurant's Web page or ask Dun and Bradstreet to annotate the Web site of a new customer with their credit information.

The ongoing work on filtering and labels is one piece of a more generalized approach to metadata, which the Web Consortium calls the Resource Description Framework, or RDF. "We are trying to build a framework that the whole world can use for forms of encoding information, such as cataloguing, for 20 years," Schloss explains, pointing out that two decades is an eternity in an industry that changes every few months.

Keeping Up With Changing Content

On the vast Internet, the problem of cataloguing or evaluating dynamic information becomes mind-numbingly complicated. Because the underlying information - such as at news sites or at pages that are built in real time from databases - is changing all the time, even metadata can become quickly stale. "We are fighting an uphill battle because more and more sites are deciding to customize their data on the fly," Schloss notes.

To address this issue, IBM researchers are investigating the use of neural networks to expand the computer analysis of Web content with the aim of simplifying the reviewers' job of rating dynamically changing Web sites. In the near term, says Schloss, "we're looking at extracting information such as prices so that your search engine could automatically list a product offered by different vendors from cheapest to most expensive."

Validating The Source

Encrypted metadata, called digital signatures, represents one way of verifying whether the version of a document you have has changed in transport, whether it really is by the cited author or whether the label itself is authentic. This capability forms part of the architecture of RDF, so that "if you are sending metadata separate from the content, you can guarantee that it came from who you think it came from and that it has not changed in transmission," says Schloss. One important application of metadata digital signatures could be a code imprimatur attached to a Java applet that says "this has been scanned for viruses." Before a company allows that applet through its firewall, an RDF-based program would check for the tag. If it were lacking, the program would send the applet elsewhere to be checked for viruses. Ultimately, Schloss envisions a world of informed information, in which some filters and label sources are hidden in the software and others are explicitly chosen by users. Rather than confining the World Wide Web, Schloss sees metadata as freeing people to use electronic material in a more intelligent way: "I believe people are going to end up using metadata much the way they do recommendations and reputations in their lives today - selectively," he says. "For serious decisions, they will probably rely on trusted branded sources, but in other cases, like entertainment, people may choose to relax their filters. Filtering is not going to be a cage; people can turn it on and off."


Marguerite Holloway is a science writer based in New York City


More Information:

Making PICS Scalable

Underlying the more general questions about the ways in which PICS will be used are challenging issues of implementation. As the World Wide Web expands, the usefulness of PICS will largely depend on the ability to generate and maintain the labels assigned to the various sites and documents found on the Web. A team at the China Research Laboratory (CRL) in Beijing has focused on these problems.

Perhaps the most basic challenge is creating the labels. The CRL team, building on Watson's PICSLE program, have developed a Java applet to help in that process. "It allows users to quickly visit URLs, review the content and assign a label," explains team leader Dong Liu. The ultimate goal, he adds, is to create a dynamic tool that would incorporate data mining, QBIC (see "Querying by Image Content," Research, Number 3, 1996, page 22) and other information extraction technologies, thereby simplifying the task of the human reviewers.

In a typical PICS environment, URL requests from a user's browsers are first sent to a proxy server that forwards the requests to the label bureau. The label bureau searches through its files for a label associated with the requested URL and returns information to the proxy, which then either allows or blocks the request. As more and more labels are created, the filing system for managing the labels must be scalable.

In response to a request from a group at IBM Raleigh, Liu's team was asked to develop a scalable solution for IBM Connection Server, recently renamed Domino Go Webserver, which was limited by its use of a flat-file system. "We created an interface," says Liu, "that allows the server to access an IBM DB2® relational database. As a result, the server can now handle tens of thousands of labels." CRL's rating applet, moreover, was designed to work together with the label database.

For More Information see: http://www.ibm.com/Stories/1997/03/pics.html

and http://www.w3.org/PICS




    About IBMPrivacyContact