Functional Genomics & Systems Biology Group

Our group investigates integrative approaches to combine functional and systems level knowledge with more traditional genomic code and annotation information. Other projects include network inference, machine learning, topological analysis and simulation of biological pathways.

Quick links:


Announcing the DREAM project:
We are co-organizing an effort to facilitate work on Reverse Engineering cellular networks. We call this effort the DREAM Project (DREAM: Dialogue on Reverse Engineering Analysis and Methods).

Just released:

Minutes of the last DREAM Conference hosted by the NY Academy of Sciences




Background

The amount and rate of accumulation of biological information is increasing rapidly, as is demonstrated by the fact that there are now several hundred biological databases accessible from the World Wide Web. Each of these databases is devoted to a particular category in the hierarchy of biological organization (Figure 1). Some examples include NCBI's
Genbank catalogs DNA data, The Swiss Institute for Bioinformatics' SWISS-PROT/TrEMBL. is a repository of protein sequences, and Kyoto University's KEGG is a compilation of biochemical and genetic circuits.

The list of genes and proteins of an organism, however, constitute only the ground level in a pyramid of biological complexity as shown in Fig. 1 below. This pyramid is a metaphor for the hierarchy of structures out of which biological function (such as metabolism or replication) arises. Elements at each level in this hierarchy interact with each other to produce a higher level of organization, thus climbing up one step in the pyramid of bio-complexity. It follows that the information of all the genome and all the proteome (which is roughly where we stand now) is insufficient to understand the subtleties of biological function. In order to get a handle to function we need to understand how the building blocks at a given level of organization interact with each other: how proteins interact with both genes and proteins to produce molecular circuits; how these circuits interact with each other to allow for cellular function; how cells interact to produce tissues; how tissues form organs; and finally how organs work together to create a living being.

Figure 1 - This pyramid represents the many levels of organization in biology. Systems biology is mostly concerned with genes and proteins as they function in metabolic and signalling networks. Systems biology is also concerned with higher levels of organization of cells, tissues, organisms and up to systems of organisms.


An understanding of the behavior of biological systems at each level of their organization can only be achieved by careful study of the complex dynamical interactions between the components of these systems. For this understanding to be quantitative it is necessary to develop structurally, biochemically and biophysically detailed mathematical models. Once developed, these models can be simulated, analyzed, and visualized through application of modern engineering and computational approaches, such as the models pioneered by groups such as the E-Cell Project
and the Center for Nonlinear Dynamics in Physiology and Medicine.

Why should IBM Research be interested in simulating life processes? This question can be answered by analogy with the problem of local weather prediction. Local weather is the result of large scale fluid motion and cloud microphysics. In spite of the fact that the equations governing these phenomena have been known for a long time, weather prediction continues to be a difficult problem even today. The reason for this is, in part, that weather is governed by non-linear, partial differential equations, whose solutions depend subtly on the initial and boundary conditions. These solutions are very often counterintuitive giving rise, e.g., to the phenomena of convection and fluid turbulence. IBM has recognized the basic-science importance and the potential business value of improved weather forecast through funding of the
Deep Thunder project, one of a few selected endeavors undertaken by the IBM Deep Computing Institute. Along the same lines, biological systems are governed by stochastic, non-linear, partial differential equations, and as such, their solutions can only be produced by means of massive computation. As a matter of fact, scientists do not yet know the precise set of equations nor the boundary conditions that govern most of the biological processes. The basic science challenge is thus there for us to conquer. The potential business value of biological-system modeling can be understood in that biology is becoming a major factor in medicine and health, which encompasses a market of over a trillion dollars.


Overview of our projects
Data mining and analysis
Data
collection
Reverse
engineering
Biological simulation
Figure 2- The diagram above illustrates a cycle of steps in systems biology research. Starting at the left is the data collection step. A two-dimensional mass spectrometry proteomics data set is shown for illustration. From here, various types of analysis are performed to extract relevant biological information. Shown is patterns discovered in gene expression data in cancer and non-cancer patients. The step on the right, labelled reverse engineering, refers to determining the relationship amongst biological entities. The network corresponds to gene regulatoion in E. coli bacteria in which the nodes represent genes and the edges correspond to causal regulatory connections (i.e., an upstream gene can activate or repress downstream genes). The bottom step is a biological simulation in which the causal relationships have additional detail. Here, the relationship between entities are given by mathematical equations. The development of a simulation project generally invokes new questions or the need for model refinement, so that new data need to be collected. These step completes the cycle.

The diagram above illustrates a cycle of steps in systems biology research. We work in each of these areas as each is important to understand biological systems. These steps are now described in more detail:

Data collection

Starting at the left side, biological data is collected. A two-dimensional mass spectrometry proteomics data set is illustrated in Fig. 1. However, iIn principle, our group is interested in any kind of experimental data as no one type of data fully characterizes a system. Our group has also worked extensively on gene chips which are a high throughput data source that assesses gene transcript levels across hundreds or thousands of genes. The group is also interested in other emerging, high thoroughput technologies that are such as ChIP-chip.

Data mining and analysis

The emergence of high throughput data presents both opportunities and challenges. While gene expression chips can assay the transcript levels across the whole genome, difficulties can arise in isolating the relevant biological information amongst virtually thousands of fluctuating signals. These fluctuations can arise because of either biological variability or noise inherent to the measuring technology. One of the main areas of research for our group is to understand and separate the different sources of noise in different high throughput measurement technologies. For more infomation, please see our Gene Expression Analysis Project Page.

Imagine for a moment a somewhat analogous problem of understanding macroeconomics based on the thousands of stock prices that can be tracked. When trying to assess the state of the economy, one can look at statistical measures that track one or more stocks. Looking at a single stock will give limited information. Looking at multiple stock prices will give more information but there is no a priori way to assess which weighed combination of stock prices will be most revealing. An obvious strategy would be to look for stocks that track together in phase as important indicators. Note however that such patterns do not alone indicate causality, i.e. the stock may track together as one stock influences others or the whole group may be under the influence of unseen forces. Finally, stock prices are limited and may only reveal some aspects of macroeconomics that also encompass factors such as labor, currency markets, and transportation issues which are not directly indicated by stock prices alone.

For gene expression, we use pattern discovery to look for groups of genes that show similar expression levels across different sets of conditions or across different groups of subjects. A free downloadable software package called Genes@Work allows the user to do pattern discovery in gene expression data. Genes@Work also contains a module with machine learning techniques, which can be used to find classifiers that best discriminate between two conditions such as disease vs. healthy states. Knowing a subset of genes whose expression level best discriminates between different phenotypes could lead to the development of diagnostic tests.

While finding important patterns in biological data can be revealing, important gaps in knowledge can remain. As described in the stock analogy, finding correlations or patterns in data do not imply causality. The observation that expression levels of multiple genes may track together does not necessarily mean that one gene directly influences the others in the group. Alternatively, the whole group of genes may be under the control of unseen factors. Moreover, much of the biology in cells involves changes that do not occur at the level of changes in gene expressions. As such, systems biology must be able to deal with limitations of the data at hand. These features are explored further in the next sections.

Reverse engineering

A fundamental notion in systems biology is that life derives from the dynamic interactions that occur in living organisms. Hence knowing all the genes and proteins that participate in these interactions will not be enough, and one must also understand how these entities interact to perform complex cellular processes. As shown in Fig. 2, work in our group involves several projects to better understand the network of interactions that occur in biological systems. A widely held goal of systems biology is to collect data on a biological system, and then use the data to reconstruct or reverse engineer the biological interactions that produced the data. This task is often referred to as network infererence.

The network on the right side of Fig. 2 shows a gene regulatory network in E. coli bacteria in which the nodes represent genes and the edges correspond to causal regulatory connections between those genes. This illustrates an important difference between the reverse engineering step and the data mining step. In the data mining step, one is looking for correlations in the data (i.e. genes whose expression is trending up or down in syncrony). In reverse engineering, one wants to use data to try to determine which genes act as leaders and which genes are downstream targets. This can be a very difficult problem, especially as random noise or hidden influences tend to confound the analysis techniques.

Another line of research involves topological based analysis of networks. In this work, complicated networks (such as the one in the right of Fig. 1) are decomposed into smaller subgraphs that appear repeatedly in the larger graph. Here the goal is to find the smaller, repeating "modules" that can be composed together to form larger networks (analogous to repeated logic gates that can be composed into a microprocessor). The repeating modules may constitute functionally related elements that are conserved through evolution. One line of reasoning in systems biology is that large and complicated signalling networks can be better understood as functional modules.

Biological simulation

After one understands the causal relationships in a biological system, the next level of description is to develop simulations. In this step, one must formulate mathematical equations to explicitly describe the relationships between the entities in the biological system. The process is often difficult for a number of reasons. First, the experimental data is often incomplete to define accurate mathematical formulations. Second, systems cannot be modeled on first principles. With our supercomputers, we can now simulate some large biological molecules such as rhodopsin using molecular dynamics (see the rhodopsin simulation from the Blue Gene Project Page). However, these molecular dynamics-based approaches are far too computationally expensive to scale up to thousands or millions of molecules and proteins that exist in even the simplest of biological systems. Hence, in the foreseeable future, we must work with fairly abstract and parsimonious representations of biological models.

Our group has pursued simulation projects across two main lines of research. The first is the P53 system that suppresses tumors by the arresting cell cycle when chromosomal DNA is damaged to allow for specialized enzymes to repair the DNA breaks. When the damage is too severe to be repaired, then the P53 system induces apoptosis. The critical role of P53 in suppressing tumors is underscored by the observation that over 50% of all cancers involve a disruption of this system. In response to DNA damage, the protein levels of P53 and its antagonist MDM2 oscillate roughly out of phase with respect to each other. To see more about this modeling effort, please see the Cancer Modeling Project Page.

The second main area of modeling is of cardiac muscle. In this work, we seek to understand fundamental properties of cardiac muscle. Here the difficulties arise as subtle changes at the molecular scale can manifest as critical changes in behaviors at the whole heart level. Moreover, drugs act at the level of molecules and proteins, so being able to accurately predict drug effects at the whole heart level will require models that bridge across many orders of spatial and temporal scales. The goals of this work are to provide better models that can aid in understanding disease processes and potential therapeutics. Despite some major advances, heart and cardiovascular diseases remain leading killers in the developed world.To see more about this modeling effort, please see the Multiscale Cardiac Modeling Project Page .

Researchers:
To carry out these projects, the Functional Genomics and Systems Biology Group has assembled a diverse set of researchers located mainly at IBM T.J. Watson Research Center. We also collaborate with other groups within the Computational Biology Center and IBM Research as well as Healthcare and Life Sciences. In addition, we have a number of academic and industrial collaborators.

Here is quick tour of our group:

We also work closely with a number of researchers in other groups in IBM Research.

Projects:
We are currently working in several projects including:

] Last updated 1 Jun 2007