Overview

In Broad Strokes


The work of the Bioinformatics & Pattern Discovery Group focuses on a number of theoretical and applied problems that are of relevance to computational molecular biology.

As a group, we focus on the following areas:
  • we work on developing algorithms for the mining of data in the absence of knowledge of the nature of the data. As such our algorithms are generic in their nature and consequently applicable to a very large number of the problems one encounters in the field of Life Sciences. The development of data mining algorithms has in fact been a long standing effort of the group

  • we generate practical solutions that are based on our algorithms, in order to showcase the great value that our methods hold for solving many problems from the Life Sciences, including ones of critical importance. In fact, the nature of the solutions that we have targeted has been getting increasingly closer to the problems that the end practitioner (e.g. a bench scientist in a pharmaceutical setting, a hands-on academic researcher etc.) is interested in;

  • we develop methodologies that are of relevance to molecular dynamics and use simulations to study and understand important biological systems
  • we generate and freely make available content that we derive from biological repositories.


  • RNA interference (RNAi)


    Probably one of the best discussed examples of an early observation of this phenomenon pertains to Rich Jorgensen’s effort in the late 1980’s to engineer deep purple petunias by introducing extra copies of chalcone synthase, a key enzyme in anthocyanin biosynthesis, in the form of double stranded RNA. This resulted in engineered petunias that turned out white or patterned. In the late ‘90s, scientists realized that what was responsible for the results of the petunia experiments was a mechanism now known as post-transcriptional gene silencing (PTGS). The mechanism resulted in the decrease of expression for both the introduced and the endogenous copy of the chalcone synthase gene thus leading to white or patterned petunias. Since Jorgensen’s serendipitous discovery, scientists have learned a lot about PTGS and RNAi. Initially, it was thought to serve as a defense mechanism that organisms devised to protect themselves from the activity of viruses and transposable elements that have invaded their genomes. But nowadays, RNAi is believed to be a very important element for running the biological processes that we have come to know from our decades-long studies of cells.

    Through our work, we try to analyze and address questions such as "how many microRNAs are encoded by a given genome?" and "given a microRNAs how many and which are its targetsi?". TO this end we have developed "rna22" a pattern-based method for addressing these questions. Our computational analyses to date suggest several hypotheses that paint a picture of cell regulation that is substantially different than what is currently believed. First, we find that there may be as many as a few tens of thousands of endogenously encoded microRNAs (and their respective precursors) in the human genome. Second that as many as 90% of the known protein-coding human genes may be targets of one or more microRNAs. And, third, that a microRNA may target as many as a few thousands genes.

    "Junk" DNA


    In recent work, we described a large-scale computational analysis of the human genome that was aimed at revealing the underlying connections between coding and non-coding DNA. We discovered a very large number of short very-well-conserved blocks that we termed pyknons: pyknons were originally discovered in the intergenic and intronic regions of the human genome and shown to have additional copies in the 5'UTRs, CDSs and 3'UTRs of almost all known protein-coding human genes. Our studies also showed that the pyknons are connected to biological processes and to RNA interference. Notably, this work predicted the existence of piRNAs that were later reported experimentally by three different groups. We continue our work in order to better understand this very extensive layer of cell process regulation

    Systems Biology, Medical Informatics, and Much More


    In recent years, there has been increased use of the term "Systems Biology" in the open literature. To a certain extent, this term refers to a renewed thrust in pursuing topics in the context of the century-old field of Physiology, and we could attempt to summarize it as "an effort to study and understand biological systems by bringing together theoretical, experimental and computational approaches."

    Systems biology is expressly cross-disciplinary in nature and its domain of study spans a hierarchy of organismal organization levels, with each level comprising units diverse in nature (e.g. genes, proteins, pathways, organelles, etc). Assuming a comprehensive list of these units, systems biology seeks to characterize the static and dynamic behavior of these units as well as the complex inter- and intra-level relationships in which these units participate. The eventual reward is the building of a "holistic view" of the organism under study that is expected in turn to enhance our knowledge of the organism's static and dynamic behavior.

    The Bioinformatics and Pattern Discovery group's work is connected to the wider scientific community's systems biology initiative through activities that focus on precisely these three aspects: we develop methods and tools that permit the discovery of a parts list for each level of the hierarchy (e.g. genes, pathway components, miRNA precursors, gene permutations, antimicrobial peptides, etc.) the characterization of their static and dynamic behavior (e.g. protein annotation, sites where promoters bind, horizontal gene transfers, etc.) and, finally, the discovery of the relationships in which these parts figure (e.g. co-regulated genes, targets of mature miRNAs, etc.).

    In addition to the above, in the Bioinformatics and Pattern Discovery group we have been spending a lot of resources on developing methods and tools for the problem of "Medical Informatics." Medical Informatics can be defined as the collection, maintenance and processing of medical data in order to discover meaningful and actionable bits of information that can increase our understanding of nature, and can be translated into improved health benefits for the patient. Given the group's expertise, the vast majority of our activity in this area revolves, naturally, around the mining aspect of Medical Informatics.

    Current Activities


    The group's areas of ongoing activity include the following (in no particular order):
  • metagenomics;

  • analysis of the "junk" DNA of eukaryotic organisms;

  • iron metabolism;

  • mouse embryonic stem cell differentiation;

  • cancer from the standpoint of cell process regulation;

  • algorithms for pattern and association discovery in event streams;

  • algorithm parallelization for shared-memory and message-passing architectures

  • multiple sequence alignment;

  • tools for the analysis of gene expression data

  • irredundant motif discovery

  • data compression

  • medical information systems for decision support

  • automated protein annotation directly from sequence

  • gene discovery in prokaryotic genomes

  • prediction of the fine-structure of transmembrane helices

  • horizontal gene transfer

  • the discovery of phylogenetic domain specific signatures

  • study of the family of the herpesviruses

  • comparative and evolutionary genomics

  • RNA interference in eukaryotes and viruses

  • discovery of topological motifs in graphs

  • rational engineering of antimicrobial peptides

  • design of filters for unsolicited-electronic-mail (SPAM)



  • Content


    In addition to creating methods and tools, we continuously produce metadata (i.e. 'content') from public databases of biological sequences and make it available through are web site. Following the relevant link on the left of the page, you will be able to access a manually curated database on the herpesvirus 5 (cytomegalovirus), a very large repository of automatic annotations of complete genomes, etc.