IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Libraries without roofs
COVER STORY: Digital Libraries: Libraries without Roofs

By Stephen S. Hall

IBM has played a major role in assembling digital libraries - vast databases of text, images, video and audio - and adapting them to serve academic, research and commercial needs.


In Brief:

Announced in March 1995, IBM's Digital Library project incorporates technologies from several Research groups. The project aims to provide digitized text, graphics, video and audio data online to a wide range of clients in the academic, research and commercial sectors. Research teams are now adapting their technologies to specific customers' needs.


In antiquity, the greatest repository of knowledge in the civilized world was the famed library of Alexandria. At the end of the 4th century, when it was destroyed, the library contained virtually everything known at the time. Over the ensuing 16 centuries, the amount of information held in libraries and other repositories hasincreased by many orders of magnitude. However, thanks to several complementary technologies that fall under the umbrella term "digital libraries," something resembling an electronic Alexandria - or, more properly, many specialized Alexandrias - is beginning to take shape.

For several years, researchers at IBM have worked closely with customers to design what they call Early-to-Market Digital Library (EtM-DL) systems, which make accessible vast stores of digitized information. Users can access everything from yesterday's newspaper and last year's scientific journals to rare medieval manuscripts and 20 years of United States patents. "The vision of Digital Library is one in which we'll allow all this information to be captured, digitized and made broadly available," says Willy Chiu of the Santa Teresa Laboratory, who directs IBM's Digital Library efforts.

What are digital libraries?

Just as conventional libraries are more than simple collections of books, digital libraries are more than collections of digitized data. "Digital libraries," explains Hamed Ellozy, senior manager for digital libraries at the Thomas J. Watson Research Center, "provide a sophisticated set of tools to capture information and to find and access it on one's desktop. That accessibility will make a fundamental change in the way people deal with information."

IBM announced its entry into the Digital Library arena in March 1995. "All the key technologies were in place and had evolved out of IBM Research projects," says Jon Prial, manager of IBM Digital Library market development. "They included the ability to do searching of text and images; and, in the future, will include video searching. Research provided a robust architecture for managing all this information and tools for protecting intellectual property rights." (See "Visible and Invisible Fingerprinting",.) IBM's Digital Library is a wide-ranging effort because systems must match and marry a daunting array of technologies. Like a traditional library, it needs a core collection of information, along with ways to acquire, access and update it (creating or capturing content in digitized form); a sensibly organized repository for all this collected information (storage); a way to index the entire inventory, so that users can quickly find what they seek (search and access); a means of delivering the document to everyone "visiting" the library electronically (distribution and user interfaces); and a security system that protects the owners of the information from copyright infringement, assures the integrity of the information (rights management) and allows owners to profit from their content. Many of those technologies emerged from special projects that the Research division undertook to solve customers' unique problems, even before the Digital Library announcement. Perhaps the earliest in-house pioneer was software architect Henry Gladney of the Almaden Research Center, who began investigating the potential of digital libraries in 1986. One possibility, he says, was "handling the paper involved in the central operations of bureaucracies, such as tax agencies."

By 1989, Gladney was working with state officials in California to design a prototype called Integrated Records Management (IRM). It incorporated a Gladney-designed storage component called Document Storage Subsystem (DocSS) to handle commercial tax accounts. IRM became a prototype digital library, widely used by state and local govern- ments, and DocSS has come to form part of VisualInfo(TM), which constitutes one cornerstone of current Digital Library technology. Typifying Research's approach to digital libraries, customers have played a major role in defining the requirements for, as well as creating, the components of a digital library. According to Robert Morris, senior manager of data systems technology at Almaden, one serendipitous result of this work has been the creation of a new model by which we conduct research in the marketplace.

Technology for digital capture

The first step in creating a digital library is capturing the data digitally - especially if the original form is analog, such as printed text or a work of art. In 1987, an IBM team in Madrid, Spain, helped the Spanish government digitize the fragile papers that chronicled Columbus's voyages to the New World.

One year later, the staff of artist Andrew Wyeth asked IBM if researchers could design a system to digitize and store images of the artist's paintings. A team at Watson developed a digital camera and imaging software. Then, working from color transparencies, they created a private digital library of approximately 10,000 Wyeth works, now known as the Brandywine project (see "Capture Technology: The Digital Camera," below). That camera technology played an essential role in a more ambitious, ongoing project with the Vatican Library that began in 1993. As a first phase, researchers scanned more than 20,000 manuscript pages into the Vatican's digital archive. Currently, 10 select scholars throughout the world can examine the manuscripts from desktop computers, and a sampling of images is available on the IBM home page. "The broader goal," says Watson's Fred Mintzer, "is to make these materials available to a worldwide scholarly community. The Internet is a natural access medium for such a collection."

A university solution

At the far extreme from the 10 Vatican scholars, in needs as well as numbers, are students who might use a digital library in an educational setting. Since 1994, a team of researchers from Almaden and Watson, headed by Francis Parr of Watson, has worked with the Florida Center for Library Automation to design a digital library system to serve the entire University of Florida academic community. The heart of the system is VisualInfo, adapted to distribute information on an IBM RISC System/6000® Scalable POWERparallel System® SP2(TM) supercomputer.

A major challenge was to design a universal gateway to the VisualInfo system - that is, an architecture that can accommodate queries from 200,000 students on 11 different campuses using every conceivable personal computer. The team constructed an interface based on the World Wide Web, which IBM presented to the university regents in September 1995.

Parr predicts that the university's digital library will contain several hundred gigabytes of content by midyear, when limited supervised access at library workstations is due to start. The Watson group, meanwhile, has developed and is incorporating an advanced Web interface, known as Portal, into the digital library. The Portal interface will permit easy access, via text and image searches, to documents stored in VisualInfo.

Commercial applications

Meanwhile, a multidepartmental team at Almaden, led by Morris, Jim McCrossin and Norm Pass, started a content distribution project originally called Data Factory, and now a part of Digital Library. The effort was based on an idea by Stephen Boyer, a member of the PC Server Group who is based at Almaden, and carried out with support from the IBM PC Company. In July 1994, the team began to design an EtM-DL system for the Philadelphia-based Institute for Scientific Information (ISI), which publishes indexes and abstracts of articles for scientific, technical and medical literature. Almaden's EtM system delivers an electronic version of ISI's weekly Current Contents for the life sciences that includes an online search capability and access to complete articles.

"Current Contents is the bibliographic data, or metadata, for 1,350 journals in the life sciences," explains project manager Laura Anderson. "Libraries subscribe to a subset of those journals, and get the scanned journal page images online, as well." Indeed, a key to the design is that a user can search the metadata and retrieve desired articles for browsing and printing.

The Almaden team developed a scalable, multitiered distribution architecture for ISI, with a main server located at ISI and "campus servers" at pilot locations that include universities, public libraries and industrial laboratories. Lotus Notes® was used to support the campus servers and clients, and Dick Dievendorff led the way for ISI programmers to build a Lotus Notes application for Current Contents.

The core of the system was an extensive DB2(TM) bibliographic database designed by David Choy, who also enabled ADSM (ADSTAR® Distributed Storage Manager) to work coherently with DB2. ADSM can store a large amount of data in such a way that more frequently used information can be accessed at higher speeds.

Picking up patents online

In another project whose origins preceded any formal digital library effort, Almaden has developed a system for making 20 years of patent information, including illustrations, available online for searching, viewing, printing and delivery through a Web browser, VM(TM) or Lotus Notes, at all major IBM sites. Working with the British firm Derwent Patent Information Co., the Almaden group has made accessible more than 1.7 million U.S. patents, dating back to 1975.

The idea arose out of work by Boyer in the late 1980s. "It started with an executive at Novo Nordisk, the largest drug company in Denmark, who expressed the desire for the ability to mount a patent search from his desktop," says Boyer.

A key hurdle in the electronic delivery of patents is the need to use the scanned image form, which contains 100 times more bits than the text-coded form. Researchers at Almaden - notably Ray Holland, Mark Jackson and Kin Wong - worked out an image technology for swift delivery from a library of more than 2,000 CD-ROMs. And Tom Griffin has developed technology that makes images compatible across a vast variety of endstations. "Using this technology," says McCrossin, manager of the patent effort at Almaden, "we can dynamically optimize the image format without having to pretranslate all the images."

The impact of indexing

One little-appreciated aspect of digital library search technology is the creation of metadata - indexing of the data being stored. Metadata allows people to search the content of stored data, as opposed to titles alone, for example, as in traditional library. Search engines (see "Simplifying Search Engines") must be able not only to search, but also to index new data as soon as it is entered. Commercial search engines can index 100 to 200 megabytes of text data per hour. A team at Watson, led by Alan Marwick, has developed a prototype that uses pattern-matching of text and statistical techniques to index up to a gigabyte of data hourly.

Functions such as indexing, metadata and video create unique storage demands. "As you go from text to images to audio to video, the amount of storage goes up by an order of magnitude every time," explains Ellozy.

Advanced Solutions

"Digital Library," says Chiu, "has become the key integration point for a host of innovative technologies. In the course of developing our EtM DL systems, we are pushing the envelope of leading-edge technology, learning about customer requirements and bringing advanced solutions to market. Following a very successful year in 1995, our goal is to continue to lead with innovative industry solutions in 1996."


Stephen S. Hall is a freelance science writer based in Brooklyn, New York. His latest book is entitled Mapping the Next Millennium.


Simplifying Search Engines

Each application of digital libraries depends crucially on the mix of its component technologies. The aim of next-generation search engines, for example, is to give users a set of tools that help them find what they need in a collection without any specialized training. Alan Marwick, who heads a research group at the Thomas J. Watson Research Center that specializes in advanced text search technologies, says that another overall aim is ease of use - ideally pointing and clicking one's way to desired information.

To that end, Watson researchers have developed more interpretative search engines. Many IBM Digital Library projects, for example, use SearchManager(TM), developed by the IBM software development group in Böblingen, Germany. Recently, a module developed at Watson, called Guru, has been incorporated into the package. Conceived by Yoelle Maarek (now at the Haifa Research Laboratory), Guru has moved beyond traditional keyword searches to queries in natural language.

Next-generation search engines also identify variations on the same proper name using Nominator, a program developed by Yael Ravin, Nina Wacholder and Misook Choi at Watson. And they use what Marwick calls a "context thesaurus," developed by Roy Byrd at Watson, to find associations in texts that would be useful to a searcher. Users often have difficulty finding appropriate search words to describe their information needs. Rob Barrett, at the Almaden Research Center, developed the Relevance and Trash (RAT) system to alleviate this problem. With RAT, the computer can suggest ways to improve searches after the user marks a few retrieved documents as "relevant" or "trash."

Search engines are not limited to text. One pilot digital library developed at Watson pairs a text search engine with QBIC (Query by Image Content), an engine developed at Almaden to search databases for images. And IBM Research is working with DreamWorks SKG, Viacom and several other media companies to develop the capacity for video archive searches.


Capture Technology: The Digital Camera

IBM's Digital Library initiative has consistently shown that customers' needs beget technical innovation. "We started working on a digital camera and high-quality color imaging as the capture technology for the Andrew Wyeth project, and we've continued to enhance it," explains Fred Mintzer, at the Thomas J. Watson Research Center.

The camera uses a CCD sensor chip akin to the sensitive light-gathering devices in telescopes. That chip, designed at Watson and manufactured by IBM, "allows us to capture images with high resolution and high signal-to-noise ratio," says Mintzer. Special color filters used with the digital camera achieve high-quality color reproduction. "The filters were designed to simulate the color response of human vision," says Mintzer. "Consequently, our color reproduction is much better than film or other digital cameras that you see in the marketplace."

A Vatican Library team is using two of the cameras to scan about 160 images per day en route to digitizing additional holdings from the Vatican's collection; about 30,000 images have been digitized thus far. Other cameras are digitizing holdings at the Lutherhalle Wittenberg museum in Germany, and at the National Gallery of Art, the Library of Congress's Federal Theater Project and the Smithsonian Institution, in Washington, D.C.


Digital Library Project At Tokyo Research Laboratory

Researchers at the Tokyo Research Laboratory have been engaged in a digital library joint project with the National Museum of Ethnology since 1986. The aim, the creation of a digital museum, has resulted in several prototypes that demonstrate the feasibility of advanced multimedia technologies, such as color-image database construction, query by image content (QBIC), image retrieval by colors and shapes, video indexing and multimedia museum education. Recently, the team began developing what will be the first global digital museum, which will link together over the Internet some of the foremost institutions in the world.

TRL researchers have also developed a search tool called "Information Outlining," which visualizes information retrieval processes so that users can zero in on the target data. The prototype system contains a year's worth of newspaper articles as a sample database. Once the user enters a keyword, the number of articles containing it, along with the date of publication and part of the headline, are displayed in the upper portion of the screen.

Various kinds of information about the retrieved data is displayed on the lower portion. For example, if the search word is "Internet," the lower portion might show a map of Japan indicating how many articles are relevant to each prefecture. Users can select a given region by clicking on it with a pointing device. In addition to geographical viewers, the prototype offers chronological and topical viewers, which show the time periods and topics covered in the retrieved articles.


Visible and Invisible Fingerprinting

A key component of almost any digital library system is what is known as "rights management." This protects the owners of material from copyright infringement. "Security is essential to people entrusting their documents to a library," says Henry Gladney at the Almaden Research Center. "No security, no digital library."

For the original Vatican Library project, Fred Mintzer's group at the Thomas J. Watson Research Center developed the "visible-image watermark." This technology alters the intensity of thousands of pixels in each image, leaving an obvious, but nonobtrusive, seal of ownership on rare documents (in this case, the seal of the Vatican Library). "We believe that, to remove it, one would have to use a pixel-level editor on thousands of pixels - a very time-consuming task at best," says Mintzer. "Although this is not a total protection, it is strong encouragement to do the right thing."

Jeffrey Lotspiech and Cynthia Dwork, researchers at Almaden, spearheaded projects that spawned a number of technical innovations in rights management, including the use of visible and invisible "fingerprinting" to mark text in the ISI's Current Contents project. The visible fingerprint involves superimposing a two-dimensional code, similar to a supermarket bar code, on the bottom of each page of text printed out by a user. The 1,000-bit fingerprint encodes the date, time, identity of the user, and other data. If illegal copies are distributed, they can be traced back to the source.

An invisible fingerprint can be buried in the data, Lotspiech says, by flipping six bits within a digitized image. "None of these things is perfect," Lotspiech admits. "But you would have to take overt action to defeat this method of tracking."

In a different approach, Marc Kaplan at Watson has developed an encryption package known as Cryptolope, which basically contracts a user to honor copyrighted material and not copy it, before activating the software that decrypts the requested information. "Cryptolope solves the problem of distributing copyrighted information on the Web," says Jeffrey Crigler, vice president, infoMarket, who cooked up the idea with Kaplan.

InfoMarket, a recently announced IBM service, allows content providers to package their data in a secure container, Cryptolope, for sale on the Web.



    About IBMPrivacyContact