IBM's FLASH algorithm not only finds hidden similarities in newly sequenced DNA or proteins but also matches molecular structures and identifies fingerprints in seconds. Some day it might even help Deep Blue to win a chess
championship
In Brief:
One of the crucial challenges for computer vision programs is recognizing
items that are similar but not identical to objects in a database. To carry out
extremely fast similarity matching, IBM researchers created the FLASH algorithm.
They have recently applied it to a wide range of uses that have the goal of
matching a query object with similar objects from a database. Applications of
FLASH include matching new DNA and protein sequences with existing sequences
whose functions are known, matching molecular shapes and structures, and
identifying fingerprints for civil applications.
Scientists call them minutiae: the marks of distinction in the whorls of
ridges and valleys that make up your fingerprints. They are the points where the
ridges stop or start or split from one into two, and they are there for life.
Like fingerprints themselves, they are different on each hand and on each
finger, and not even identical twins share the same set. So if you want to help
a computer find two fingerprints that match among millions or billions that don't,
you might teach it to look for the minutiae, and ignore the rest.
This is what scientists at IBM's Thomas J. Watson Research Center have done,
in order to create a system that can compare a fingerprint with millions of
others in a database and find a match, if one exists, within seconds. The
fingerprint identification program is an offspring of an algorithm known as
FLASH (for Fast Look-up Algorithm for Structural Homology).
Starting as a computer vision program developed by Andrea Califano and
Rakesh Mohan, it is now used to help facilitate the design of chemicals and
drugs and for matching the structures of biological molecules or the sequences
of new genes and proteins - a task that is a key to the success of the
multibillion dollar Human Genome Project. FLASH's eventual uses, Califano says,
may include any searching applications in which the goal is to match two objects
in a database that may be similar but not necessarily identical.
Similar but not identical
In computer vision, explains Califano, who is now manager of Watson's
computational biology and pattern matching group, the problem is known as
model-based object recognition. "The fundamental thing is that computers
are not very good at identifying similarity," he explains. "They are
good at quickly determining identity: 'a equals b.' But with 'a
is similar to b,' they have real problems." While one may imagine that
the goal is to find identical matches in fingerprinting, for example,
differences in the way a fingerprint is taken means that each time it will be
subtly different. It is essential, therefore, to be able to determine that two
similar fingerprints, although not identical in every regard, are, in fact, the
same fingerprint.
When Califano and his colleagues began work on FLASH in the early 1990s, he
says, the majority of methods used just one way to see if a query object was
similar to another object in a database: "Start at the beginning of the
database and compare the object to every other object, one at a time."
FLASH is what's known as a probabilistic indexed algorithm. That is, it
looks for matches only in those places where it's likely to find them and
ignores the rest of the database. It does this by generating a table of indices
- that is, ways of representing both the database objects and the query
object that allow the database object to be quickly considered or rejected as a
candidate for the match. "You then go and check the candidates,"
says Califano. "It restricts your search to a few candidates rather than
thousands or millions."
Research staff member Bob Germain compares FLASH's operation to a
dictionary. "When you look up a word in a dictionary," he explains, "you
don't start at the As and flip through, saying: "Is this "zebra"?
No, and then go to next word with the same question and so on until you
finally get to "zebra"? "Instead you have a set of thumb tags in the
dictionary with the first letter or maybe the first couple of letters, and you
can flip immediately to the Z section where you know you're going to find
the word. That's analogous to the indexing strategy that we use. That's one of
the key characteristics of FLASH."
This strategy is not only fast, its performance degrades very slowly as the
size of the database increases, a feature that allows the approach to be scaled
up to handle very large databases.
After developing the program to compare two-dimensional images for computer
vision, Califano realized "that the same approach might be applied to
almost anything." Possibility number one was the Human Genome Project,
which is churning out new sequences of DNA at the rate of about a million
base-pairs a day. The result is strings of letters-A, G, C and T,
designating the four nucleotides of DNA - that together make up genes and,
when translated into matching strings of amino acids, proteins.
Knowing the sequences tells biologists little or nothing about the function.
So they have compiled huge databases of the sequences of known genes and the
structures and sequences of known proteins. These permit researchers to take a
new DNA sequence or a new protein and look for matching sequences whose
functions are already known. "It was taking several minutes, if not more,
to do a scan of databases with conventional techniques," says Califano. "I
wanted to reduce that to just a few seconds."
In 1992, Califano and research staff member Isidore Rigoutsos started work
on the sequence matching of DNA and proteins. This is a problem of similarity
and not identity, explains Rigoutsos, because sequences can have considerable
mutations along the way and yet still be similar enough to be biologically
related. "Depending on the extent of matches between your new sequence and
the sequences you already know," he says, "you can speculate about the
potential function of your new sequence."
Structure is function
With the sequence-matching algorithm down pat, Rigoutsos and Califano moved
on to develop an algorithm that would actually match the shapes of molecules.
Biology has a saying that structure is function: the shape a protein takes, for
example, is at least as critical to its function as the specific sequence of its
amino acids. Thus, it's as important to find proteins or molecules that can take
on the same shape as it is to find those with the same sequence.
The gist of the problem, says Califano, was to match the shape of molecules
if they were "floppy" -which means that any one molecule might be
found in an infinite variety of similar but not exact orientations. The
solution, says Rigoutsos, was to index molecules by invariants in their
structures - that is, to find which groups of molecules are rigid, and the
directions in which the "floppy" rotatable bonds would emerge out of
those rigid groups. "Basically, for every rigid group," Rigoutsos
continues, "we have a representation of where the group expects the rigid
bonds to be."
The IBM researchers are now collaborating with a major pharmaceutical
company, which hopes to use the system to design new drugs or to find new drug
candidates. Although the details are proprietary, Rigoutsos says that the
program can be used to find molecules that have the same shape as others whose
biological function is known, or whose biological function scientists might want
to imitate. "You may be able to produce an expected shape that will match a
specific receptor pocket," he says. "You might want to ask: do I have
anything in my repository of molecules to produce the shape I'm interested in
reproducing?"
The secrets of minutiae
As for fingerprints, that work started in the summer of 1995. Its aim: to
create a system for fast, efficient fingerprint checking for applications in
civil law.
"How do you guarantee someone has only a single identity in your
system, if you're giving out passports or welfare checks?" asks
Germain. "You want to make sure somebody doesn't have 12 passports under 12
different names or isn't collecting 12 different welfare checks under different
names. You want to check that this person is only in your system once."
Traditional systems were designed for use in law enforcement as an aid to
trained personnel. The systems are good at producing lists of possible suspects
by searching a fingerprint database, which an officer can then check. But civil
applications must be fully automated and capable of accurately determining
whether a given fingerprint is already in the database. Moreover, one must be
able to get an answer more quickly and for much larger databases than is
necessary in law enforcement.
The solution is in the minutiae. But the trick lies in extracting the right
information. The fingerprint extraction program determines the coordinates of
all the relevant minutiae and the direction that a given ridge happens to be
pointing when it comes to a minutia. It then turns that information into
invariants that can be indexed and saved for comparison.
For example, the algorithm will take combinations of three minutiae and turn
them into triangles on the fingerprint, which then serve as indices. "You
can use the lengths of sides of the triangle, for instance, as invariants,"
Germain explains. "By adding other types of invariant information, one can
increase the descriptivity of an index."
Eventually the program will generate a few thousand indices for any given
fingerprint, to be stored and labeled in what's known as a hash table. Each
label points back to the fingerprint that generated it. When a new fingerprint
comes along, the algorithm creates indices for that fingerprint and then looks
in the relevant indices for matches. "This is where we're only
searching a small fraction of the data," says Germain. "There might be
a billion possible index values we allow, but you only look at a few thousand,
like looking only at the As and Bs in the dictionary." As before, the
search time grows linearly with database size, but the rate of increase is
extremely small.
To test the algorithm, Germain purchased a test database of fingerprints.
Running on an 8-node RS/6000(TM) scalable parallel supercomputer, the program
is already running an order of magnitude faster than anything in the market
today. The search time grows slowly with the size of the database because users
are always searching roughly the same number of indices, no matter how big the
database. "The things we're thinking about," he says, "are
searching databases with a million or 10 million people."
An IBM solution utilizing FLASH for fingerprinting is being readied by a
team at the Hursley Development Laboratory in the United Kingdom, who are
working with business, marketing and services specialists and much larger
fingerprint databases to design systems that meet customers' business
needs.
Developing new applications
Meanwhile, IBM researchers are starting to explore new uses for FLASH. One
possibility is to use a variant of FLASH to improve the search capability of
Deep Blue, IBM's chessplaying supercomputer. The program would allow Deep Blue
to scan quickly through a large database of games, Germain says, "to find
positions similar or close to the one Deep Blue is trying to evaluate, so it
could take advantage of the games in the database to improve the accuracy of its
scoring functions."
FLASH can also be used to search more conventional databases: looking for
matches in medical and business records, for example. A company that wants to
convert records stored in various types of databases to a single database
without spending months on the task could use FLASH to do the job. "99
percent of the work can be done automatically," Califano says. "Wherever
there is fuzziness in the applications, you can write a small amount of
domain-specific work and in just a matter of weeks have something running on a
PC or SP that will do what you need. Then you can spend as much time as you want
fine-tuning the specifics."
Gary Taubes is a freelance writer based in Boston. His latest book is
entitledBad Science: The Short Life and Weird Times of Cold Fusion.
[ IBM Research home page ]