IBM has played a major role
in assembling digital libraries -
vast databases of text, images, video and audio - and adapting them to
serve academic, research and
commercial needs.
In Brief:
Announced in March 1995, IBM's Digital Library project incorporates
technologies from several Research groups. The project aims to provide
digitized text, graphics, video and audio data online to a wide range of
clients in the academic, research and
commercial sectors. Research teams are now adapting their technologies to
specific customers' needs.
In antiquity, the greatest repository of knowledge in the civilized world
was the famed library of Alexandria. At the end of the 4th century, when it
was destroyed,
the library contained virtually everything known at the time.
Over the ensuing 16 centuries, the amount of information held in libraries
and other repositories hasincreased by many orders of magnitude. However,
thanks to several complementary technologies that fall under the umbrella
term "digital libraries," something resembling an electronic Alexandria -
or, more properly, many specialized Alexandrias - is beginning to take
shape.
For several years, researchers at IBM have worked closely with customers to
design what they call Early-to-Market Digital Library (EtM-DL) systems,
which make accessible vast stores of digitized information. Users can
access everything from yesterday's newspaper and last year's scientific
journals to rare medieval manuscripts and 20 years of United States
patents. "The vision of Digital Library is one in which we'll allow all
this information to be captured, digitized and made broadly available,"
says Willy Chiu of the Santa Teresa Laboratory, who directs IBM's Digital
Library efforts.
What are digital libraries?
Just as conventional libraries are more than simple collections of books,
digital libraries are more than collections of digitized data. "Digital
libraries," explains Hamed Ellozy, senior manager for digital libraries at
the Thomas J. Watson Research Center, "provide a sophisticated set of tools
to capture information and to find and access it on one's desktop. That
accessibility will make a fundamental change in the way people deal with
information."
IBM announced its entry into the Digital Library arena in March 1995. "All
the key technologies were in place and had evolved out of IBM Research
projects," says Jon Prial, manager of IBM Digital Library market
development. "They included the ability to do searching of text and images;
and, in the future, will include video searching. Research provided a
robust architecture for managing all this information and tools for
protecting intellectual property rights." (See "Visible and
Invisible Fingerprinting",.)
IBM's Digital Library is a wide-ranging effort because systems must match
and marry a daunting array of technologies. Like a traditional library, it
needs a core collection of information, along with ways to acquire, access
and update it (creating or capturing content in digitized form); a sensibly
organized repository for all this collected information (storage); a way to
index the entire inventory, so that users can quickly find what they seek
(search and access); a means of delivering the document to everyone
"visiting" the library electronically (distribution and user interfaces);
and a security system that protects the owners of the information from
copyright infringement, assures the integrity of the information (rights
management) and allows owners to profit from their content.
Many of those technologies emerged from special projects that the Research
division undertook to solve customers' unique problems, even before the
Digital
Library announcement. Perhaps the earliest in-house pioneer was software
architect Henry Gladney of the Almaden Research Center, who began
investigating the potential of digital libraries in 1986. One possibility,
he says, was "handling the paper involved in the central operations of
bureaucracies, such as tax agencies."
By 1989, Gladney was working with state officials in California to design a
prototype called Integrated Records Management (IRM). It incorporated a
Gladney-designed storage component called Document Storage Subsystem
(DocSS) to handle commercial tax accounts. IRM became a prototype digital
library, widely used by state and local govern-
ments, and DocSS has come to form part of
VisualInfo(TM), which constitutes one cornerstone of current Digital
Library technology.
Typifying Research's approach to digital libraries, customers have played a
major role in defining the requirements for, as well as creating, the
components of a digital library. According to Robert Morris, senior manager
of data systems technology at Almaden, one serendipitous result of this
work has been the creation of a new model by which we conduct research in
the marketplace.
Technology for digital capture
The first step in creating a digital library is capturing the data
digitally - especially if the original form is analog, such as printed text
or a work of art. In 1987, an IBM team in Madrid, Spain, helped the Spanish
government digitize the fragile papers that chronicled Columbus's voyages
to the New World.
One year later, the staff of artist Andrew Wyeth asked IBM if researchers
could design a system to digitize and store images of the artist's
paintings. A team at Watson developed a digital camera and imaging
software. Then, working from color transparencies, they created a private
digital library of approximately 10,000 Wyeth works, now known as the
Brandywine project (see "Capture Technology: The Digital
Camera," below).
That camera technology played an essential role in a more ambitious,
ongoing project with the Vatican Library that began in 1993. As a first
phase, researchers scanned more than 20,000 manuscript pages into the
Vatican's digital archive. Currently, 10 select scholars throughout the
world can examine the manuscripts from desktop computers, and a sampling of
images is available on the IBM home page. "The broader goal," says Watson's
Fred Mintzer, "is to make these materials available to a worldwide
scholarly community. The Internet is a natural access medium for such a
collection."
A university solution
At the far extreme from the 10 Vatican scholars, in needs as well as
numbers, are students who might use a digital library in an educational
setting. Since 1994, a team of researchers from Almaden and Watson, headed
by Francis Parr of Watson, has worked with the Florida Center for Library
Automation to design a digital library system to serve the entire
University of Florida academic community. The heart of the system is
VisualInfo, adapted to distribute information on an IBM RISC
System/6000® Scalable POWERparallel System® SP2(TM) supercomputer.
A major challenge was to design a universal gateway to the VisualInfo
system - that is, an architecture that can accommodate queries from 200,000
students on 11 different campuses using every conceivable personal
computer. The team constructed an interface based on the World Wide Web,
which IBM presented to the university regents in September 1995.
Parr predicts that the university's digital library will contain several
hundred gigabytes of content by midyear, when limited supervised access at
library workstations is due to start. The Watson group, meanwhile, has
developed and is incorporating an advanced Web interface, known as Portal,
into the digital library. The Portal interface will permit easy access, via
text and image searches, to documents stored in VisualInfo.
Commercial applications
Meanwhile, a multidepartmental team at Almaden, led by Morris, Jim
McCrossin and Norm Pass, started a content distribution project originally
called Data Factory, and now a part of Digital Library. The effort was
based on an idea by Stephen Boyer, a member of the PC Server Group who is
based at Almaden, and carried out with support from the IBM PC Company. In
July 1994, the team began to design an EtM-DL system for the
Philadelphia-based Institute for Scientific Information (ISI), which
publishes indexes and abstracts of articles for scientific, technical and
medical literature. Almaden's EtM system delivers an electronic version of
ISI's weekly Current Contents for the life sciences that includes an online
search capability and access to complete articles.
"Current Contents is the bibliographic data, or metadata, for 1,350
journals in the life sciences," explains project manager Laura Anderson.
"Libraries subscribe to a subset of those journals, and get the scanned
journal page images online, as well." Indeed, a key to the design is that a
user can search the metadata and retrieve desired articles for browsing and
printing.
The Almaden team developed a scalable, multitiered distribution
architecture for ISI, with a main server located at ISI and "campus
servers" at pilot locations that include universities, public libraries and
industrial laboratories. Lotus Notes® was used to support the campus
servers and clients, and Dick Dievendorff led the way for ISI programmers
to build a Lotus Notes application for Current Contents.
The core of the system was an extensive DB2(TM) bibliographic database
designed by David Choy, who also enabled ADSM (ADSTAR® Distributed
Storage Manager) to work coherently with DB2. ADSM can store a large amount
of data in such a way that more frequently used information can be accessed
at higher speeds.
Picking up patents online
In another project whose origins preceded any formal digital library
effort, Almaden has developed a system for making 20 years of patent
information, including illustrations, available online for searching,
viewing, printing and delivery through a Web browser, VM(TM) or Lotus
Notes, at all major IBM sites. Working with the British firm Derwent Patent
Information Co., the Almaden group has made accessible more than 1.7
million U.S. patents, dating back to 1975.
The idea arose out of work by Boyer in the late 1980s. "It started with an
executive at Novo Nordisk, the largest drug company in Denmark, who
expressed the desire for the ability to mount a patent search from his
desktop," says Boyer.
A key hurdle in the electronic delivery of patents is the need to use the
scanned image form, which contains 100 times more bits than the text-coded
form. Researchers at Almaden - notably Ray Holland, Mark Jackson and Kin
Wong - worked out an image technology for swift delivery from a library of
more than 2,000 CD-ROMs. And Tom Griffin has developed technology that
makes images compatible across a vast variety of endstations. "Using this
technology," says McCrossin, manager of the patent effort at Almaden, "we
can dynamically optimize the image format without having to pretranslate
all the images."
The impact of indexing
One little-appreciated aspect of digital library search technology is the
creation of metadata - indexing of the data being stored. Metadata allows
people to search the content of stored data, as opposed to titles alone,
for example, as in traditional library. Search engines (see "Simplifying Search Engines") must be able not only to
search, but also to index new data as soon as it is entered. Commercial
search engines can index 100 to 200 megabytes of text data per hour. A team
at Watson, led by Alan Marwick, has developed a prototype that uses
pattern-matching of text and statistical techniques to index up to a
gigabyte of data hourly.
Functions such as indexing, metadata and video create unique storage
demands. "As you go from text to images to audio to video, the amount of
storage goes up by an order of magnitude every time," explains Ellozy.
Advanced Solutions
"Digital Library," says Chiu, "has become the key integration point for a
host of innovative technologies. In the course of developing our EtM DL
systems, we are pushing the envelope of leading-edge technology, learning
about customer requirements and bringing advanced solutions to market.
Following a very successful year in 1995, our goal is to continue to lead
with innovative industry solutions in 1996."
Stephen S. Hall is a
freelance science writer based in Brooklyn, New York. His latest book is
entitled Mapping the Next Millennium.
Simplifying Search Engines
Each application of digital libraries depends crucially on the mix of its
component technologies. The aim of next-generation search engines, for
example, is to give users a set of tools that help them find what they need
in a collection without any specialized training. Alan Marwick, who heads a
research group at the Thomas J. Watson Research Center that specializes in
advanced text search technologies, says that another overall aim is ease of
use - ideally pointing and clicking one's way to desired information.
To that end, Watson researchers have developed more interpretative search
engines. Many IBM Digital Library projects, for example, use
SearchManager(TM), developed by the IBM software development group in
Böblingen, Germany. Recently, a module developed at Watson, called
Guru, has been incorporated into the package. Conceived by Yoelle Maarek
(now at the Haifa Research Laboratory), Guru has moved beyond traditional
keyword searches to queries in natural language.
Next-generation search engines also identify variations on the same proper
name using Nominator, a program developed by Yael Ravin, Nina Wacholder and
Misook Choi at Watson. And they use what Marwick calls a "context
thesaurus," developed by Roy Byrd at Watson, to find associations in texts
that would be useful to a searcher. Users often have difficulty finding
appropriate search words to describe their information needs. Rob Barrett,
at the Almaden Research Center, developed the Relevance and Trash (RAT)
system to alleviate this problem. With RAT, the computer can suggest ways
to improve searches after the user marks a few retrieved documents as
"relevant" or "trash."
Search engines are not limited to text. One pilot digital library developed
at Watson pairs a text search engine with QBIC (Query by Image Content), an
engine developed at Almaden to search databases for images. And IBM
Research is working with DreamWorks SKG, Viacom and several other media
companies to develop the capacity for video archive searches.
Capture Technology: The Digital Camera
IBM's Digital Library initiative has consistently shown that customers'
needs beget technical innovation. "We started working on a digital camera
and high-quality color imaging as the capture technology for the Andrew
Wyeth project, and we've continued to enhance it," explains Fred Mintzer,
at the Thomas J. Watson Research Center.
The camera uses a CCD sensor chip akin to the sensitive light-gathering
devices in telescopes. That chip, designed at Watson and manufactured by
IBM, "allows us to capture images with high resolution and high
signal-to-noise ratio," says Mintzer.
Special color filters used with the digital camera achieve high-quality
color reproduction. "The filters were designed to simulate the color
response of human vision," says Mintzer. "Consequently, our color
reproduction is much better than film or other digital cameras that you see
in the marketplace."
A Vatican Library team is using two of the cameras to scan about 160 images
per day en route to digitizing additional holdings from the Vatican's
collection; about 30,000 images have been digitized thus far. Other cameras
are digitizing holdings at the Lutherhalle Wittenberg museum in Germany,
and at the National Gallery of Art, the Library of Congress's Federal
Theater Project and the Smithsonian Institution, in Washington, D.C.
Digital Library Project At Tokyo Research
Laboratory
Researchers at the Tokyo Research Laboratory have been engaged in a digital
library joint project with the National Museum of Ethnology since 1986. The
aim, the creation of a digital museum, has resulted in several prototypes
that demonstrate the feasibility of advanced multimedia technologies, such
as color-image database construction, query by image content (QBIC), image
retrieval by colors and shapes, video indexing and multimedia museum
education. Recently, the team began developing what will be the first
global digital museum, which will link together over the Internet some of
the foremost institutions in the world.
TRL researchers have also developed a search tool called "Information
Outlining," which visualizes information retrieval processes so that users
can zero in on the target data. The prototype system contains a year's
worth of newspaper articles as a sample database. Once the user enters a
keyword, the number of articles containing it, along with the date of
publication and part of the headline, are displayed in the upper portion of
the screen.
Various kinds of information about the retrieved data is displayed on the
lower portion. For example, if the search word is "Internet," the lower
portion might show a map of Japan indicating how many articles are relevant
to each prefecture. Users can select a given region by clicking on it with
a pointing device. In addition to geographical viewers, the prototype
offers chronological and topical viewers, which show the time periods and
topics covered in the retrieved articles.
Visible and Invisible Fingerprinting
A key component of almost any digital library system is what is known as
"rights management." This protects the owners of material from copyright
infringement. "Security is essential to people entrusting their documents
to a library," says Henry Gladney at the Almaden Research Center. "No
security, no digital library."
For the original Vatican Library project, Fred Mintzer's group at the
Thomas J. Watson Research Center developed the "visible-image watermark."
This technology alters the intensity of thousands of pixels in each image,
leaving an obvious, but nonobtrusive, seal of ownership on rare documents
(in this case, the seal of the Vatican Library). "We believe that, to
remove it, one would have to use a pixel-level editor on thousands of
pixels - a very time-consuming task at best," says Mintzer. "Although this
is not a total protection, it is strong encouragement to do the right
thing."
Jeffrey Lotspiech and Cynthia Dwork, researchers at Almaden, spearheaded
projects that spawned a number of technical innovations in rights
management, including the use of visible and invisible "fingerprinting" to
mark text in the ISI's Current Contents project.
The visible fingerprint involves superimposing a two-dimensional code,
similar to a supermarket bar code, on the bottom of each page of text
printed out by a user. The 1,000-bit fingerprint encodes the date, time,
identity of the user, and other data. If illegal copies are distributed,
they can be traced back to the source.
An invisible fingerprint can be buried in the data, Lotspiech says, by
flipping six bits within a digitized image. "None of these things is
perfect," Lotspiech admits. "But you would have to take overt action to
defeat this method of tracking."
In a different approach, Marc Kaplan at Watson has developed an encryption
package known as Cryptolope, which basically contracts a user to honor
copyrighted material and not copy it, before activating the software that
decrypts the requested information. "Cryptolope solves the problem of
distributing copyrighted information on the Web," says Jeffrey Crigler,
vice president, infoMarket, who cooked up the idea with Kaplan.
InfoMarket, a recently announced IBM service, allows content providers to
package their data in a secure container, Cryptolope, for sale on the Web.