HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|
Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA
*To whom correspondence should be addressed. Tel: +1 203 432 6105; Fax: +1 360 838 7861; Email: mark.gerstein@yale.edu
Received February 6, 2002; Revised and Accepted August 8, 2002
ABSTRACT |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
INTRODUCTION |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
|
In general, it is relatively difficult to integrate disparate information sources into one comprehensive database; it is difficult to determine which data sets will be useful under different circumstances. We present some useful examples in the Gene Exploring section on how one can extract biologically relevant and novel information from our database. Nevertheless, these demonstrations, using specific features within GeneCensus, do not provide a reason for the inclusion or exclusion of any specific features in the database.
FEATURE OVERVIEW: TREEVIEWER |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
|
Ribosome
These trees are built through comparing the similarity of the ribosomal RNA. This traditional method (12) for phylogenetic analysis is based on the small subunit ribosomal RNA (SSU rRNA). For comparison, the trees based on the large subunit (LSU) are also provided.
Folds
These trees are built based on the presence or absence of folds in different organisms, as determined by Hegyi et al. (13). In addition, we compare trees based upon the subdivision of folds into classes (all alpha, all beta, alpha + beta and alpha/beta). A further comparison in this category differentiates between the distance-based and parsimony techniques for tree building.
Superfamilies
Superfamilies are less broad structural groupings than folds, and because of their greater number, they have been found to be more differentiating, producing trees similar to the traditional phylogeny. The data were collected using a similar approach to Hegyi and Gerstein (14).
COGs
We also compare the genomes based on the occurrence of orthologous genes based on COGs, clusters of orthologous genes (1). Trees were built for the three major types of COGs (i.e. metabolism, cellular processes, information storage and processing), as well as for the smaller functional categories. We represent these categories using single letters in the user interface.
Composition
These trees are built on the simple composition of the amino acids and dinucleotides. The trees marked raw are based on the absolute number of amino acids and dinucleotides. These values are used to generate a vector and the calculated distance. For the other trees, the numbers calculated are normalized by the total number, producing percentages, which were used to generate a distance matrix for tree construction.
ORFs
This set of trees is composed of trees built on the sequence similarity of homologous genes. The genes chosen for this comparison were present in the genomes only once; thus paralogous genes were not a factor.
Enzymes
The sequence similarity of individual enzymes in the three central metabolic pathways (the TCA cycle, glycolysis and the pentose phosphate pathway) was used to construct these trees.
FEATURE OVERVIEW: PATHWAYPAINTER |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
|
Flux analysis
Flux, a measure of the rate at which metabolites are processed to become output, is calculated at a steady state (15). Determination of flux provides critical information for rational pathway modification and metabolic engineering (16,17). While there are many published maps of pathways illustrating the processing of basic metabolites, they provide little in terms of describing pathway fluxes under diverse conditions.
We obtained raw absolute flux values for three organisms (S.cerevisiae, B.subtilis, E.coli) (1820) (These are reported as absolute fluxes on the website). For two organisms (H.influenzae and H.pylori), we calculated theoretical relative flux values using stoichiometric analysis. We describe this calculation here. Our first step involves reconstructing the map of the central metabolic pathway in the two organisms using information from the KEGG metabolic database (21). It is known that H.influenzae (22) and H.pylori (23) have incomplete TCA cycles. We decomposed the reconstructed pathway into elementary modes using the METATOOL software (24). Each elementary mode consists of a minimal set of enzymes that could operate at steady state with all irreversible reactions proceeding in the appropriate direction and further reduced to omit extraneous metabolites not necessary for the net reaction (25). One should note that there is more than one elementary mode that can connect two chemical species. In order to choose the best elementary mode that represents the most efficient routes of chemical conversion, we optimized the process by using a combined objective function of maximization of ATP and minimization of glucose use. We obtained the ratio of the end products of glucose metabolism produced from earlier studies. For example, H.influenzae produces succinate and acetate as the end products of glucose metabolism in the ratio of 4.3:1 (22). These ratios act as constraints in the optimization process. We used the LINDO software to carry out this optimization.
Finally, we merged the results for all five organisms and normalized the flux values to make them comparable. Normalization of values is done with respect to glucose intake (i.e. the entry of glucose in the metabolic pathway is considered to represent 100% relative flux). We computed the relative flux that inputs into various pathway routes as a fraction of this initial amount. Therefore, even though the actual pathway flux can vary from one organism to another, normalized fluxes are comparable in a relative sense. We report these final normalized values on the website.
Expression levels
PathwayPainter encompasses information from a variety of gene expression experiments, corresponding to data collected with both DNA and cDNA arrays. In particular, we have collected various microarray data sets from the Stanford Microarray Database (26) and extracted expression data for each enzyme in the following three pathways: (i) TCA cycle, (ii) glycolysis and (iii) pentose phosphate. While the determination of gene expression levels using high-throughput experimentation is a growing field, we present here mainly data sets derived from yeast experiments, the most common organism for expression analysis to date. The dynamic nature of GeneCensus will allow us to provide additional microbial expression data sets as they become available.
In the current version of GeneCensus we focus on six individual experiments: (i) cell-cycle experimentation of yeast cells synchronized by alpha factor arrest (27); (ii) a second cell-cycle experiment where yeast cells were similarly synchronized via the arrest of a cdc15 temperature-sensitive mutant (27); (iii) a yeast diauxic shift experiment measuring the temporal program of gene expression following a metabolic shift from fermentation to respiration (28); (iv) an assay measuring the change in yeast expression during sporulation (29); (v) an experiment capturing the cellular response of E.coli following exposure to UV radiation (30); and (vi) a profile of gene expression in the germ line of Caenorhabditis elegans (31).
Enzyme View
The Enzyme View of PathwayPainter details the absolute and normalized flux levels for the enzyme in each genome. Additionally, it allows for the visualization of sequence similarity between the organisms compared with the specific enzyme for which the flux is being measured. Below the PID table, another table outlines the expression values of that specific enzyme in yeast and E.coli under multiple conditions. Finally, links are provided to the TreeViewer wherein the user can view trees based on that particular enzyme.
In relation to gene expression, for each enzyme, we report the following. (i) Raw unscaled values R as available from the various sites as either copies per cell, log2(ratio of mRNA levels), or normalized transcript level divided by mean value, depending on the specific data set. We represent these as P(i, t), which represent the expression of gene i at time t. The calculated ratios are thus R(i, t) = P(i, t)/P(i, r) in which P(i, r) is the reference state. (ii) Multi-experiment scaled expression value M derived from multiple experiments (32), which provides a standard of comparison for expression data. This is derived from scaling together various microarray and SAGE data sets and is on an absolute scale in copies per cell (32). (iii) Average expression ratio change C over of the length of the profile, which is calculated by C = R(i, t). This measures the degree of variability in expression in a particular experiment for a given enzyme. (iv) Expression ratio fluctuation E was calculated using the standard deviation of expression ratios (i.e. . Note that enzymes consistently expressed under most conditions will show minimal standard deviations that are closely correlated between experiments.
MODULES |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
ORFViewer
The ORFViewer module provides various resources related to a given ORF. For example, protein structural annotations are graphically represented on a plot for every ORF. These include: (i) PDB PSI-BLAST matches; (ii) regions of low complexity [identified with the SEG program using standard parameters K(1) = 3.4, K(2) = 3.75 and a window size of 45 residues (33)]; (iii) transmembrane segments [defined using the GES hydrophobicity scale (34)]; (iv) linker regions (e.g. low complexity regions); and (v) uncharacterized regions.
We also provide organism-specific information. For example, additional information available for yeast includes gene expression data, subcellular localization (35), protein structural comparisons (36), previously unannotated genes (37), transposon tagging, gene disruptions (38) and data from protein chip experiments (39). Organism-specific information for Mycoplasma genitalium information includes a breakdown of structural characterizations of the genome and related NESGC construct database entries (40,41).
The proliferation of multiple interchangeable nomenclature schemes remains an outstanding problem in compiling ORF annotation. For example, YJR155W, AADA_YEAST, AAD10, J2245, P47182 and CAA89688.1 all refer to the identical ORF in the yeast genome. As a solution, we include Smartlink in GeneCensus; Smartlink is a translator that integrates all the disparate systems and maps one nomenclature to another.
OrganismViewer
The OrganismViewer integrates varied information about an organism. For each organism, we provide an amino acid composition viewer (see below for more details), links to the source data, and additional links to external resources. These pages were designed as an open system, allowing users to add relevant resource links. We will continue to integrate more studies and resources for each genome.
The OrganismViewer not only presents the research specific to GeneCensus, but also integrates research from other institutions and acts as a central resource and link manager for each genome. For example, the C.elegans page provides comprehensive analysis of pseudogenes and protein folds; alternatively, the S.cerevisiae page includes studies of the relationship between fold and function, microarray expression data analysis (42), clustering of phenotype patterns (38), subcellular localization of proteins (35), composition of individual ORF features (43) and comparison of the yeast and worm genomes in terms of folds (44). We have also included a statistical analysis of thermophilic and mesophilic genomes (45).
CompositionViewer
In early cryptography, it was discovered that one could break ciphers by analyzing the frequency of letters/symbols and their combinations within a coded text. When the frequency of a letter or letter combination differs from the expected frequency of a randomly inserted letter, it can be interpreted as significantand possibly deciphered. Similarly, a protein sequence can be decoded and important and significant amino acid combinations discerned by examining the frequency and specific occurrences of entities whose occurrence is higher than they would be if inserted randomly (i.e. the expected occurrence of a given amino acid within a sequence). These combinations may be important for concepts such as binding or protein structure (46). Additionally, determining the individual amino acid composition of the ORFs in thermophilic organisms is essential for understanding their stability in extreme environments.
The CompositionViewer, another module within Gene Census, provides composition information about genomes similar to that used to build and compare trees in the TreeViewer; the CompositionViewer is a valuable tool for analyzing the aforementioned amino acid pair patterns in a genome. Included within this component are (i) calculations of amino acid composition, (ii) secondary structure, and (iii) a tool for dynamically calculating amino acid composition pairs. Pairs, in terms of amino acid composition, are defined as combinations of an amino acid residue with another residue at separation i, i + k. For instance, AL3 corresponds to an alanine residue and a leucine residue at i, i + 3 (AxxL). The significance of over- and under-represented pairs is calculated by comparing the observed occurrence of the pair in a database with a random expectation distribution. The random distribution is calculated as the average of any possible internal permutations of all sequences of the database. The significance corresponds to the probability of observing the same or a larger difference between observed and expected occurrences of a pair in random sequences.
For membrane data we used the TMSTAT method (43) to analyze frequently occurring combinations of residues of transmembrane domains. The server returns the most significant over- and under-represented singlets and pairs in each database.
This viewer is accessed either through the organism viewer or directly at http://bioinfo.mbb.yale.edu/genome/tmstat/comp.cgi.
GENE EXPLORING: PRACTICAL RESULTS USING GENECENSUS |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
In Figure 4, we illustrate the scientific conclusions one can derive from PathwayPainter through comparing the variation in expression, flux and PID of enzymes over many different experiments and organisms. We present some of our findings here. Additionally, we show how these data may be utilized to determine which enzymes can best be used as internal controls for normalizing microarrays. The data presented in the figure include: (i) the average expression change C, which is the average of all the expression ratios from all the time points between two conditions; (ii) expression fluctuation E, which represents the standard deviation of the expression ratios from all experimental time points; (iii) flux variation F, which indicates the standard deviation of flux from the different organisms; and (iv) sequence similarity S, which is the average percentage similarity of orthologous pairs. Notably, the experiments can be compared not only with the average expression change C, but also with expression fluctuation E. Both values are important for a clear understanding of the expression ratio profiles, since enzymes may have very different average expression change and expression fluctuation values. For example, in the cdc15-arrested cell-cycle experiment, both citric synthase (4.1.3.7) and glycolaldehyde-transferase (2.2.1.1) exhibit low average expression change but very high expression fluctuation.
|
Experiment comparison. In Figure 4, PathwayPainter is used to compare six expression experiments. There are some notable global differences between these experiments. Both the E.coli and the worm expression sets show higher average expression change C, reflecting the changes in worm development and the effects of UV on E.coli. Conversely, the cell-cycle experiments show smaller average expression changes, reflecting the more constant state of housekeeping genes (e.g. metabolic pathway enzymes) within the cell. As expected in the diauxic shift experiment, the TCA cycle enzymes have high values for average expression changes C and expression fluctuations E; this substantiates previous observations that the change in medium for yeast increases the expression of TCA enzymes (28). In the yeast sporulation experiments, the positive and negative values in the average expression change capture the up- and down-regulation of different enzymes in the system.
Subsystem analysis. Different pathways or subsystems of central metabolism exhibit specific trends and characteristics. Figure 4 highlights one of these characteristics by coloring the arrows according to their flux variation F, with the highest values depicted in blue, median values in green and the lowest in red. From the schematic, as well as the table, one can see that the TCA cycle has all of the highest flux variation F, indicating that that the TCA cycle changes the most in metabolite processing. Biologically, this confirms the notion that the TCA cycle functions very differently depending on environmental factors, such as aerobic and anaerobic conditions. The flux variation for glycolytic enzymes is near the average, except for triphosphate isomerase (5.3.1.1), which provides a shunt in the pathway. Similarly, the expression fluctuation, E, also correlates well with the division into subsystems; most of the highest values (>0.47) belong to the TCA cycle, the lowest values (<0.38) belong to the pentose phosphate pathway and the middle values belong to glycolysis. Expression fluctuation and flux variation clusters similarly to pathway divisions and correlate well with each other.
Enzyme comparisons. PathwayPainter also allows for multiple comparisons of specific enzymes. Expression variation is particularly evident at branch points and control points (where reactions are essentially irreversible). These points represent those enzymes with the greatest expression fluctuations (both C and E). For example, isocitrate dehydrogenase (1.1.1.42) and citric synthase (4.1.3.7), which have the two highest average expression changes, are both important control points in central metabolism. We performed additional analyses on other control point proteins and found that their average expression changes were very high as well. For example, phosphofructokinase (2.7.1.11), hexokinase (2.7.1.2) and pyruvate kinase (2.7.1.40) have values of 1.1, 0.6 and 0.6, respectively, representing an almost 100% increase compared to many of the other enzymes. In situations where the cell is perturbed, the response is reflected in the change of expression in the control point enzymes; in the worm development, E.coli response to UV, and yeast sporulation experiments, an increase in average expression change C at branch points (4.1,3.7, 1.1.1.37) is observed.
By comparing expression variability of an enzyme across multiple data sets, we have shown that many of the important metabolic control points are most variable in terms of expression. This variability indicates the intricate regulation of these enzymes.
Percentage identity. We found that sequence similarity S does not correlate strongly with either flux variation F or average expression change C. We conclude that sequence identity is not a predictor of expression or flux values.
Normalization. The expression-variability data in Pathway Painter can also be used to assess whether a group of genes can be applied to the normalization of microarray data between experiments in different organisms and under different experimental conditions. Presently, there are many efforts underway to determine robust methods to normalize microarray data through internal controls (47). For example, attempts have been made to establish normalization based on housekeeping genes, that is, those genes thought to be consistently expressed in the vast majority of conditions. We propose that the detailed study of the expression variability in metabolic enzymes shown by the PathwayPainter module can be useful in determining which enzymes could potentially be used as constants in microarray normalization approaches.
TreeViewer illustration
Informative pan-genomic analyses can be performed with the GeneCensus TreeViewer module. For example, the traditional ribosomal tree groups Gram-positive bacteria into one homogeneous group. However, further analysis using other types of trees subdivides them in informative ways. In particular, a number of the trees available in GeneCensus show that the bacteria B.subtilis and Mycobacterium tuberculosis tend to cluster independently of the other Gram-positive bacteria M.genitalium and Mycoplasma pneumoniae. This is a product of the radically differing size and gene compositions spanning the Gram-positive class. Further analysis of Gram-positive bacteria using the composition module shows that these two organisms have a high percentage of guanine as opposed to the other Gram-positive bacteria. Thus, while we may link them together due to the high peptoglycan content in their cellular walls (i.e. resulting in a Gram-positive stain), using the multiple modules in GeneCensus, we show that they differ radically in many other genomic properties.
CONCLUSION: GENECENSUS, A COMPARATIVE GENOMICS DATABASE AND TOOL |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
ACKNOWLEDGEMENT |
---|
REFERENCES |
---|
TOP ABSTRACT INTRODUCTION FEATURE OVERVIEW: TREEVIEWER FEATURE OVERVIEW: PATHWAYPAINTER MODULES GENE EXPLORING: PRACTICAL... CONCLUSION: GENECENSUS, A... REFERENCES |
---|
|
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |