IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
New strategy for the matching game

By Gary Taubes

IBM's FLASH algorithm not only finds hidden similarities in newly sequenced DNA or proteins but also matches molecular structures and identifies fingerprints in seconds. Some day it might even help Deep Blue to win a chess championship

In Brief:

One of the crucial challenges for computer vision programs is recognizing items that are similar but not identical to objects in a database. To carry out extremely fast similarity matching, IBM researchers created the FLASH algorithm. They have recently applied it to a wide range of uses that have the goal of matching a query object with similar objects from a database. Applications of FLASH include matching new DNA and protein sequences with existing sequences whose functions are known, matching molecular shapes and structures, and identifying fingerprints for civil applications.


Scientists call them minutiae: the marks of distinction in the whorls of ridges and valleys that make up your fingerprints. They are the points where the ridges stop or start or split from one into two, and they are there for life. Like fingerprints themselves, they are different on each hand and on each finger, and not even identical twins share the same set. So if you want to help a computer find two fingerprints that match among millions or billions that don't, you might teach it to look for the minutiae, and ignore the rest.

This is what scientists at IBM's Thomas J. Watson Research Center have done, in order to create a system that can compare a fingerprint with millions of others in a database and find a match, if one exists, within seconds. The fingerprint identification program is an offspring of an algorithm known as FLASH (for Fast Look-up Algorithm for Structural Homology).

Starting as a computer vision program developed by Andrea Califano and Rakesh Mohan, it is now used to help facilitate the design of chemicals and drugs and for matching the structures of biological molecules or the sequences of new genes and proteins - a task that is a key to the success of the multibillion dollar Human Genome Project. FLASH's eventual uses, Califano says, may include any searching applications in which the goal is to match two objects in a database that may be similar but not necessarily identical.

Similar but not identical

In computer vision, explains Califano, who is now manager of Watson's computational biology and pattern matching group, the problem is known as model-based object recognition. "The fundamental thing is that computers are not very good at identifying similarity," he explains. "They are good at quickly determining identity: 'a equals b.' But with 'a is similar to b,' they have real problems." While one may imagine that the goal is to find identical matches in fingerprinting, for example, differences in the way a fingerprint is taken means that each time it will be subtly different. It is essential, therefore, to be able to determine that two similar fingerprints, although not identical in every regard, are, in fact, the same fingerprint.

When Califano and his colleagues began work on FLASH in the early 1990s, he says, the majority of methods used just one way to see if a query object was similar to another object in a database: "Start at the beginning of the database and compare the object to every other object, one at a time."

FLASH is what's known as a probabilistic indexed algorithm. That is, it looks for matches only in those places where it's likely to find them and ignores the rest of the database. It does this by generating a table of indices - that is, ways of representing both the database objects and the query object that allow the database object to be quickly considered or rejected as a candidate for the match. "You then go and check the candidates," says Califano. "It restricts your search to a few candidates rather than thousands or millions."

Research staff member Bob Germain compares FLASH's operation to a dictionary. "When you look up a word in a dictionary," he explains, "you don't start at the As and flip through, saying: "Is this "zebra"? No, and then go to next word with the same question and so on until you finally get to "zebra"? "Instead you have a set of thumb tags in the dictionary with the first letter or maybe the first couple of letters, and you can flip immediately to the Z section where you know you're going to find the word. That's analogous to the indexing strategy that we use. That's one of the key characteristics of FLASH."

This strategy is not only fast, its performance degrades very slowly as the size of the database increases, a feature that allows the approach to be scaled up to handle very large databases.

After developing the program to compare two-dimensional images for computer vision, Califano realized "that the same approach might be applied to almost anything." Possibility number one was the Human Genome Project, which is churning out new sequences of DNA at the rate of about a million base-pairs a day. The result is strings of letters-A, G, C and T, designating the four nucleotides of DNA - that together make up genes and, when translated into matching strings of amino acids, proteins.

Knowing the sequences tells biologists little or nothing about the function. So they have compiled huge databases of the sequences of known genes and the structures and sequences of known proteins. These permit researchers to take a new DNA sequence or a new protein and look for matching sequences whose functions are already known. "It was taking several minutes, if not more, to do a scan of databases with conventional techniques," says Califano. "I wanted to reduce that to just a few seconds."

In 1992, Califano and research staff member Isidore Rigoutsos started work on the sequence matching of DNA and proteins. This is a problem of similarity and not identity, explains Rigoutsos, because sequences can have considerable mutations along the way and yet still be similar enough to be biologically related. "Depending on the extent of matches between your new sequence and the sequences you already know," he says, "you can speculate about the potential function of your new sequence."

Structure is function

With the sequence-matching algorithm down pat, Rigoutsos and Califano moved on to develop an algorithm that would actually match the shapes of molecules. Biology has a saying that structure is function: the shape a protein takes, for example, is at least as critical to its function as the specific sequence of its amino acids. Thus, it's as important to find proteins or molecules that can take on the same shape as it is to find those with the same sequence.

The gist of the problem, says Califano, was to match the shape of molecules if they were "floppy" -which means that any one molecule might be found in an infinite variety of similar but not exact orientations. The solution, says Rigoutsos, was to index molecules by invariants in their structures - that is, to find which groups of molecules are rigid, and the directions in which the "floppy" rotatable bonds would emerge out of those rigid groups. "Basically, for every rigid group," Rigoutsos continues, "we have a representation of where the group expects the rigid bonds to be."

The IBM researchers are now collaborating with a major pharmaceutical company, which hopes to use the system to design new drugs or to find new drug candidates. Although the details are proprietary, Rigoutsos says that the program can be used to find molecules that have the same shape as others whose biological function is known, or whose biological function scientists might want to imitate. "You may be able to produce an expected shape that will match a specific receptor pocket," he says. "You might want to ask: do I have anything in my repository of molecules to produce the shape I'm interested in reproducing?"

The secrets of minutiae

As for fingerprints, that work started in the summer of 1995. Its aim: to create a system for fast, efficient fingerprint checking for applications in civil law.

"How do you guarantee someone has only a single identity in your system, if you're giving out passports or welfare checks?" asks Germain. "You want to make sure somebody doesn't have 12 passports under 12 different names or isn't collecting 12 different welfare checks under different names. You want to check that this person is only in your system once."

Traditional systems were designed for use in law enforcement as an aid to trained personnel. The systems are good at producing lists of possible suspects by searching a fingerprint database, which an officer can then check. But civil applications must be fully automated and capable of accurately determining whether a given fingerprint is already in the database. Moreover, one must be able to get an answer more quickly and for much larger databases than is necessary in law enforcement.

The solution is in the minutiae. But the trick lies in extracting the right information. The fingerprint extraction program determines the coordinates of all the relevant minutiae and the direction that a given ridge happens to be pointing when it comes to a minutia. It then turns that information into invariants that can be indexed and saved for comparison.

For example, the algorithm will take combinations of three minutiae and turn them into triangles on the fingerprint, which then serve as indices. "You can use the lengths of sides of the triangle, for instance, as invariants," Germain explains. "By adding other types of invariant information, one can increase the descriptivity of an index."

Eventually the program will generate a few thousand indices for any given fingerprint, to be stored and labeled in what's known as a hash table. Each label points back to the fingerprint that generated it. When a new fingerprint comes along, the algorithm creates indices for that fingerprint and then looks in the relevant indices for matches. "This is where we're only searching a small fraction of the data," says Germain. "There might be a billion possible index values we allow, but you only look at a few thousand, like looking only at the As and Bs in the dictionary." As before, the search time grows linearly with database size, but the rate of increase is extremely small.

To test the algorithm, Germain purchased a test database of fingerprints. Running on an 8-node RS/6000(TM) scalable parallel supercomputer, the program is already running an order of magnitude faster than anything in the market today. The search time grows slowly with the size of the database because users are always searching roughly the same number of indices, no matter how big the database. "The things we're thinking about," he says, "are searching databases with a million or 10 million people."

An IBM solution utilizing FLASH for fingerprinting is being readied by a team at the Hursley Development Laboratory in the United Kingdom, who are working with business, marketing and services specialists and much larger fingerprint databases to design systems that meet customers' business needs.

Developing new applications

Meanwhile, IBM researchers are starting to explore new uses for FLASH. One possibility is to use a variant of FLASH to improve the search capability of Deep Blue, IBM's chessplaying supercomputer. The program would allow Deep Blue to scan quickly through a large database of games, Germain says, "to find positions similar or close to the one Deep Blue is trying to evaluate, so it could take advantage of the games in the database to improve the accuracy of its scoring functions."

FLASH can also be used to search more conventional databases: looking for matches in medical and business records, for example. A company that wants to convert records stored in various types of databases to a single database without spending months on the task could use FLASH to do the job. "99 percent of the work can be done automatically," Califano says. "Wherever there is fuzziness in the applications, you can write a small amount of domain-specific work and in just a matter of weeks have something running on a PC or SP that will do what you need. Then you can spend as much time as you want fine-tuning the specifics."


Gary Taubes is a freelance writer based in Boston. His latest book is entitledBad Science: The Short Life and Weird Times of Cold Fusion.


[ IBM Research home page ]



    About IBMPrivacyContact