IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Think Research


 


Featured Concept
Decoding Proteins
COVER STORY: DARING TO THINK DEEP

By Bruce Schechter

In a few years, biologists will complete the momentous task of reading the entire human genome, the sequence of more than 3 billion symbols -- chemical bases -- that determine our biological natures. "That is when the real work will begin," says Isidore Rigoutsos, manager of the bioinformatics and pattern discovery group of IBM's Computational Biology Center. The genome is merely a recipe, a code, and understanding it will be one of the biggest scientific challenges of the coming millennium, one to which the algorithms and hardware of Deep Computing are particularly well suited.

The genetic code is written in an alphabet of four symbols. It is a program that directs the construction of proteins, the truly important molecules of life. And that has significant implications for the pharmaceutical industry. "When new drugs are developed, what they are targeting, with few exceptions, are proteins," says Barry Robson, IBM Distinguished Engineer and strategic adviser to the Computational Biology Center.

The critical fact about proteins is their shapes. Their nooks and crannies fit into one another like keys into locks, controlling the whole range of cellular processes. Every protein consists of some combination of the 20 different kinds of amino acids. But identifying the purpose of each protein sequence is a formidable task.

To gain this knowledge, researchers take advantage of the fact that evolution is parsimonious, using the same structures over and over. By looking for proteins whose amino-acid sequence is homologous to that of a protein whose structure is already known, scientists can make educated guesses about the unknown structure. Over the past decades, the genetic code of many organisms has become available, and many of the genes and the proteins they code for have been identified. In several cases, the structures and functions of these proteins have also been determined. Scientists have been sifting this growing database for patterns that may hold the key to protein structure. One of the latest and most promising techniques for finding patterns came about as a result of an accident.

"I fell off my bicycle and broke my back," says Rigoutsos. "Because I was in bed for three months, I had a lot of time to read." His review of the literature revealed that people had been trying to solve the problem of finding recurrent patterns in the structures of proteins or DNA by attempting to align sequences with one another. If several sequences matched around a location, scientists would take this as evidence for a pattern. Rigoutsos wondered if he could turn the process around and find patterns directly and then use the patterns to align sequences. He and Aris Floratos, another member of IBM's bioinformatics and pattern discovery group, devised a powerful algorithm they dubbed Teiresias (after the blind seer of Greek mythology) that did just that.

Teiresias finds patterns while making very few assumptions about what it is looking for. It has found uses outside biology in such areas as identifying attacks on computer systems and analyzing literary style. Using Teiresias, Rigoutsos and Floratos have compiled a "Bio-Dictionary" that may contain the key to under-
standing the language of the genes.

"Take a copy of the Wall Street Journal and remove all the spaces," Rigoutsos suggests to illustrate how the Bio-Dictionary was assembled. "You know the paragraphs, you know the symbols. The task is to find the words, but you do not know how many words are in the vocabulary, and you don't know the word boundaries. We have done the same thing for proteins." The "words" they have discovered constitute the basic vocabulary of proteins. Like human words, they link together according to rules to form sentences -- that is, proteins. The IBM researchers have begun to decipher the words in their Bio-Dictionary, to interpret what structural and functional features they represent. "The analogy to natural language appears to be deep," Rigoutsos explains. In fact, the techniques he and his colleagues are using resemble one that was applied by Michael Ventris in 1953 to decode the proto-Aegean script known as Linear B.

One of the biggest riddles Deep Computing could answer is how a strand of amino acids folds into a protein. "Nobody has yet simulated that process," Robson says. "It's a deep, fundamental problem. Until it is solved, you can't design interesting new proteins from scratch. More important still, if we can crack this problem from first principles, we can design new polymers and materials, and ultimately create molecular-scale devices."

Researchers at IBM are pursuing several complementary approaches to solving the protein-folding problem. Robson and Andrea Califano, program director of the Computational Biology Center -- together with researchers from many areas within IBM -- are combining pattern recognition methods with calculations of forces and energies between atoms.

A team led by Califano hunts for patterns in protein sequences by using an algorithm called Splash. The algorithm finds not only patterns that are identical but also those that are merely similar. "This is important in protein folding," Califano explains, "because many amino acids can be replaced by homologous ones without changing the three-dimensional structure of the proteins."

For his part, Robson is concentrating on the most fundamental approach, which attempts to calculate a protein's shape from first principles. The laws of physics dictate that a protein molecule must fold itself in such a way as to minimize its total free energy. The problem is that the total number of possible ways is astronomical. Consider a small protein consisting of 100 amino acids. Each bond between the amino acids can form in three or more ways, so that the entire space of all possible shapes contains more than 3100, or 5 x 1047, possibilities. "Searching this space is like searching for the best move in a chess game from among the vast number of possibilities," Robson says. The calculations will ultimately use IBM's most powerful machines, as well as special chips developed in Japan.

In the end, the researchers might be able to refine their algorithms enough to predict the folding of any protein structure, not just natural ones. This not only holds the promise of engineering new drugs. It could also allow the creation of unique, self-assembling molecular structures that could realize the dream of building molecular-scaled machines.

The original idea of nanotechnologists was to build nanoscale robots called "assemblers" that would construct molecular machinery. "Well, nature doesn't work by making these robots," Robson points out. Instead it specifies the linear sequence of amino acids in a protein and lets the laws of physics do the building for free. "The problem is how to do that on a general basis," Califano says. "If I want to build a hammer, how do I make my protein fold into a hammer? You can't solve that problem unless you understand how proteins fold." He and his colleagues are betting that the hammer of Deep Computing will enable scientists to do just that.

In fact, Deep Computing offers not just a hammer but an entire toolbox of techniques, technologies and philosophies. By joining the raw computing power and algorithmic virtuosity that were once the province of high-end scientific computing with the vast oceans of data typical of business computing, IBM scientists are forging a new discipline capable of solving real-world problems in all their complexity and depth.

More information


Bruce Schechter, who has just completed a Knight Science Journalism Fellowship at MIT, is the author of My Brain Is Open: The Mathematical Journeys of Paul Erdös.


    About IBMPrivacyContact