Contents
- Overview
- Inputting formatted data
- Visualizing Gene Expression Data
- Data preprocessing, filtering and scaling
- Pattern discovery
- Hierarchical clustering
- Classification with Patterns and Support Vector Machines
Genes@Work was designed to provide a detailed and comprehensive picture of the gene expression state. Such picture is presented in the form of "patterns" over a particular set of microarray samples. A pattern is a collection of genes with similar expression levels over a set of experiments. Each microarray is known a priori to belong to either one of two "phenotypes" under study. One of these phenotypes is used as a "control" phenotype, and the other phenotype will be termed the "target". The patterns are searched over the target phenotype.
Patterns can be further applied to selecting relevant genes and building predictive models. Genes@Work provides an integrated set of tools for exploring and building predictive model hypothesis by using supervised learning techniques such as Support Vector Machines.
The basic elements of the Genes@Work user interface are shown in the figures below. Inside the top level "Genes@Work window", functionality is split into two fundamental windows, the "Pattern Discovery window" and the "Classification window". The former provides access to the pattern discovery algorithm and pattern visualization, the latter provides access to the supervised learning algorithms.
The typical sequence of events when using Genes@Work, is as follows:
- input formatted data, data must be formatted in either one of two tab-separated formats, two files are required, one with the gene expression data and another describing the target phenotypes.
- visualize gene array data, gene expression is visualized through a variety of color plots, scatter plots, pattern plots and hierarchical clustering dendrograms.
- preprocess to limit data set, in most cases it is expected that many genes show no activity for the phenotypes under study, such genes may be filtered out of the analysis.
- discover patterns, run the pattern discovery algorithm, tune algorithm parameters.
- classify based on patterns, generate predictive hypothesis and apply them to previously unseen data.
- output plots or results, the list of genes involved in the patterns of interest is output to a file.
Two input data files are required in general to apply Genes@Work. The first file contains the gene expression dataset, and the second file describes the corresponding phenotype for each microarray in the dataset.These files can use one of two distinct formats supported. Both of these file formats are tab-delimited (a single tab separates each data point). The files are distinguished by the extension of the file's name, either *.affy or *.cdna. In both formats, the expression values for a particular gene are stored in a single record, and the expression values for a particular microarray or sample are stored as a single column. The tab-delimited format allows preparation of data files with commonplace spreadsheet software. The easiest way to understand the formats is to look at some examples as follows.
The *.affy format:
The first column contains descriptive text about each gene, the second column contains a unique identifier name for each gene (sometimes called the accession). The following columns contain gene expression data, where two columns are allocated for each microarray. The first column of the pair contains the expression value, a real number, and the second contains the Affymetrix® call code (Can we provide a link here?).
The first row starting with the third column contains a name for each microarray sample, the second row is ignored and usually left blank. The following rows contain the expression data with one row per gene.
When using the *.affy format for the gene expression data, the phenotypes must be described in a similar format shown below.
The first column contains descriptive text about each phenotype. The second column contains a unique name for each phenotype. In the example shown, the target phenotype is named "CLL" and the control phenotype is named "Normal".
The third and subsequent columns describe the phenotype corresponding to each microarray. A value of "1" indicates that the microarray correspond to the phenotype in the second column. A value of "0" indicates that the microarray does not correspond to the phenotype in the second column. A value of "-1" indicates that the microarray should be ignored when considering this particular phenotype. Observe that one column is left blank between samples, so that the samples can be at the same column position as in the corresponding gene expression data file.
The first row starting with the third column contains the name of each microarray sample. This name must be identical to the name used in the gene expression data file.
The *.cdna format:
The first column contains a unique identifier name for each gene (sometimes called the accession), the second column contains descriptive text about each gene. The following columns contains gene expression data, where one column is allocated for each microarray.
The first row starting with the third column contains a name for each microarray sample. The following rows contain the expression data with one row per gene.
When using the *.cdna format for the gene expression data, the phenotypes must be described in a similar format shown below.
The first column contains a unique name for each phenotype, the second column contains a descriptive text about each phenotype. The following columns describe the phenotypes corresponding to each microarray, using the same coding described for the *.affy format.
The first row starting with the third column contains the name of each microarray sample. This name must be identical to the name used in the gene expression data file.
The *.cdna format is more convenient to use when there is no need to include the Affymetrix® calls information. Observe that *.cdna and *.affy are simply file formats and can be used with data of any origin, but only the *.affy will allow working with Affymetrix® calls.
Loading formatted data:
The gene expression data file must be loaded first from the "File/Open chip data ..." menu option on the Genes@Work window main menu. Then, the phenotype description file must be loaded from the "File/Open phenotype data ..." menu option.
3. Visualizing Gene Expression Data
Immediately after the data and phenotypes are loaded, it is possible to visualize the gene expression data. The Biochip window presents the expression of all genes on a single chip encoded in color. Each little colored square corresponds to a single gene. Higher levels of expression are coded as red color tones, while lower levels are coded with green tones (it is possible to also select a blue-red color scale from the "Plot" menu). The slider bar at the top of this diagram, allows navigation through all the microarrays in the data set.
Single genes can be visualized by clicking on the corresponding little square on the Biochip window. An auxiliary window pops up with scatter plots of the gene expression and of the estimated cummulative density function over the control microarrays..
Data can be re-scaled for better visualization by using the options in the "Plot" menu.These scaling options are for visualization only and have no effect on the input to the pattern discovery algorithm nor on its results. The scalings available include visualization of the logarithm of the expression and of the control set normalized expression.
4. Data preprocessing, filtering and scaling
Expression data may be pre-processed before running the pattern discovery. For example, filtering could be used to ignore genes that are not present as judged by the Affymetrix® call. Alternatively, filtering could be used to ignore genes that show insignificant change in expression across samples. Additionally, data can be scaled by taking logarithms above a given threshold.
To access the pre-processing options, go to the "Options/Filter genes ..." menu. The following "Pattern Discovery Source Properties" window should appear.
The Filtering parameters are:
Present less than: the minimum required percentage of Affymetrix® calls indicating presence of the gene over all chips. Genes that show a lower percentage of "Present" calls are excluded.
Std. deviation less than: the minimum required standard deviation over all samples. Genes that show a lower standard deviation are excluded.
Fisher Score: also known as signal to noise ratio, scores every single gene by its ability to discriminate two phenotypes independently of other genes. This score can be used to rank all the genes and to keep a given number of them at the top. Rigurously, this is not a filtering method but a feature/gene selection method, since it requires that a phenotype be defined.
Once the filtering parameters are set, press the "Run Filter/Selection" to apply the filtering. Genes@Work will create a new data set that includes only those genes that pass the filter, the name of the new data set reflects the filtering operations applied. (Provide example here? i.e., show standard data before and after preprocessing - give parameters used so user can try it)
The Pattern Discovery window usually appears at the foreground by default, but it can always be brought to the front by selecting the "Window/Pattern Discovery" menu. The diagram to the left represents the gene expression intensities for one microarray. Red color tones represent higher intensities, and green represent lower ones, as compared to the total gene expression average across all microarrays. To the right of the Pattern Discovery window is the Pattern Discovery parameters and Pattern Table.
Before starting pattern discovery, both a gene expression data file and a phenotype description file must be loaded. The data format and loading process is described above.
The Pattern Discovery parameters are:
Phenotype: indicates the group of microarray samples over which patterns are searched. Each phenotype is identified by a name in the phenotype description file. Select the desired phenotype name to study from this drop-down combo box.
Delta: indicates the maximum deviation in normalized expression units for a gene to be included in a pattern. This constraint is also referred to as the "Delta condition". Since normalized gene expressions are in the interval [0,1], Delta must be a value in the interval [0,1]. Typical values go from 0.05 to 0.20.
Min Support: indicates a search strategy for patterns. All patterns are searched that have at least the given minimum support. The support of a pattern is defined as the number of microarray samples (in the phenotype) over which all genes in the pattern satisfy the Delta condition.
Max Pattern Count: indicates a search strategy for patterns alternative to Min Support. This strategy seeks only the patterns with the highest support and up to the number of patterns desired. It is recommended to start running with this option, instead of Min Support, because in most cases, if the minimum support value used is too small, too many patterns are discovered and the program may run for too long or use excessive memory.
Threshold: sets whether a pattern is reported in relation to how often such a pattern should occur by random chance. Specifically, threshold indicates the number of patterns of a given size that could be expected from random gene data assuming uniform independent identically distributed normalized expressions. This value will depend on the given pattern's support and number of genes. Thus, if a pattern is of such size that the number of expected random patterns of the same size exceeds this threshold, then such pattern is not reported. Naturally, large patterns, with big support or number of genes or both, will be less likely to be discarded by this threshold.
Genes: indicates the minimum number of genes the reported patterns must contain.
Independent Patterns: when this check box is enabled, the program lists only the patterns that are not conditionally dependent with each other, that is, the number of patterns that are expected to appear at random, because of the presence of another pattern, is less than the Threshold parameter.
To start the Pattern Discovery, hit the "Run PD" button (the button disables itself while the program is running). When results are ready, the "Run PD" button enables itself again. The patterns found are listed in a table to the right, the "Pattern Table", where there is one row per pattern. The patterns are numbered and characterized by their support, number of genes and P-value. Patterns can be ordered by support, number of genes or P-value by using the "Sort" menu in the main Genes@Work window.
When patterns are selected from the table, and the "BioChip" tab to the left is active, the genes in the pattern are highlighted in the microarray diagram. However, a more detailed view of each pattern can be obtained by switching to the "Eisen Plot" tab.
Since patterns are not searched on the original gene expression values, but on normalized gene expression values, it is best to visualize the patterns by using the "Plot/Control set normalized [0,1]" menu option from the main Genes@Work window.
The Eisen plot of a pattern shows the genes in the pattern as rows and the samples as columns. The samples are broken in three groups, separated by vertical yellow lines, that are shown from left to right. These three groups of samples are: samples in the phenotype where the pattern appears, samples in the phenotype where the pattern does not appear, and samples used as the control. When a single pattern is visualized, and when the "Plot/Control set normalized [0,1]" menu option is selected, it should be visually clear that the gene expression, as given by the color scale, satisfies the constraint specified by the Delta parameter for all genes in the pattern over the samples where the pattern appears.
There are two Eisen plot diagrams for patterns. The top one plots only the currently selected pattern on the Pattern Table on the right. The second one can plot a set of the patterns (a proper subset or all of them). Patterns can be added and removed from the second diagram by using the "Patterns" menu . The patterns added are combined by doing the union of all pattern genes and pattern phenotype samples, this produces a new single "super-pattern" that in general violates the Delta condition, but may be useful to quickly obtain a list of the relevant genes in the phenotype under study. To save all the genes that appear in the patterns of interest into a file, use the "Pattern/Save Patterns ..." menu option. It is also possible to save all the patterns in the Pattern Table with the "Options/Save Patterns ..." option from the Genes@Work window main menu.
Once a group of genes has been identified by merging some patterns, it is possible to visualize these genes ordered by similarity through a "dendrogram" diagram. To access this option, select the "Patterns/Clustering ..." menu, a clustering window should appear.
To generate the dendrograms, hit the "Run HC" button.
Hierarchical clustering seeks to group genes/microarrays based on the similarity of their gene expression profiles. The resulting diagram is similar to the pattern diagram, with the genes displayed as rows and the microarrays as columns. The ordering of the genes/microarrays is given by their respective dendrograms. The dendrograms are built by the Hierarchical clustering algorithm in a bottom up fashion, where the closest (most similar) groups are joined at every step. The results of hierarchical clustering depend on how the data is normalized, and four normalization options are provided. The gene expression colors in the diagram correspond to the chosen normalization.
It is also possible to run clustering on all the genes, use the "Options/Clustering ..." menu from the Genes@Work main window.
7. Classification with Patterns and Support Vector Machines
Supervised learning for Classification features are provided from the "Classification" window, to bring this window to the foreground, select the "Window/Classification" menu from the Genes@Work main window.
The user interface for Classification is designed to be flexible and to intuitively describe the process of supervised machine learning. The user builds a small graph of "Task" blocks depending on the type of experiment to be performed. Each task block may have little input "Ports" on the left side and little output "Ports" on the right side. The ports are labeled with letters like "D", "h", "L". An input port can be connected to an output port only if they have the same label letter. The label "D" stands for "Data set" and means a data set is being passed between blocks, "h" stands for "hypothesis" and "L" for "Learning machine".
The blocks are created by selecting the menus under the "Tasks" menu bar in the "Classification" window. The blocks are connected by dragging with the mouse from an output port to the desired input port.
For example, the following graph describes an experiment where a train data set and a test data set are available. The train data set is fed from a "Data Source" block to a "Learning Machine" block, which generates a hypothesis that can be applied to previously unseen test data. The test data is fed from a second "Data Source" to a "Predictor" block, together with the hypothesis.
In the next example graph, a cross validation experiment is described. A single data source is fed into the "Cross Validation" task that will break it into successive disjoint train and tests data sets. The "Cross Validation" block, will make full use of the "Learning Machine" block to generate a hypothesis on each training subset and to apply it to the corresponding test subset.
After the required task blocks are created, it is necessary to configure them. This is done by double-clicking on the block, a "Properties" window should appear. The parameters described in each "Properties" window is dependent on the type of block.
Data Source Properties:
Use the "Properties for Data Source" window to configure a "Data Source" task block. Hit the "Load data set file ..." button to load a new data set file or to select one of the already loaded data sets from the drop-down combo box. Whenever the data set is changed, a phenotype description file must be reloaded. The current phenotype, between those defined in the phenotype description file, may be changed from the "Select phenotype" drop-down combo box.
Notice that additional tabs are available for filtering, scaling and feature selection, see previously described preprocessing options.
Machine Learning Properties:
Use the "Properties for Learning Machine" window to configure a "Learning Machine" task block. Select the desired learning algorithm from the "Type" drop-down combo box. "SVM" stands for Support Vector Machines, "PD" stands for Pattern Discovery based classification and "PD/SVM" is a experimental hybrid between PD and SVM. The "Setup" tab will show parameters specific to the type of learning algorithm selected.
Hit the "Run Training" button to apply the learning algorithm to the current input data, when the training ends, the "Results" tab will be updated to show the prediction performance of the hypothesis generated on the training set. When the learning machine is run controlled by a "Cross Validation" task block, the user must make sure the desired learning machine parameters are committed by hitting the "Save Settings" button, before starting the cross validation run.
Hypothesis Properties:
Use the "Properties for Hypothesis" window to configure a "Hypothesis" task block. The purpose of this window is to support hypothesis persistence, that is saving promising hypothesis to disk, that may later be re-used into voting mechanisms with other hypothesis. This is work in progress and uncompleted at the time of this writing.
Predictor Properties:
Use the "Properties for Predictor" window to configure a "Predictor" task block.
Cross Validation Properties:
Use the "Properties for Cross Validation" window to configure a "Cross Validation" task block.
Written by J. Lepre. Please send comments to genatwrk@us.ibm.com
Affymetrix® is a U.S. trademarks owned by Affymetrix, Inc. ]


















