Nucl. Acids. Res. -- Lin et al. 30 (20): 4574

HOME

HELP

FEEDBACK

SUBSCRIPTIONS

GeneCensus: genome comparisons in terms of metabolic pathway activity and protein family sharing

J. Lin, J. Qian, D. Greenbaum, P. Bertone, R. Das, N. Echols, A. Senes, B. Stenger and M. Gerstein^*

Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA

*To whom correspondence should be addressed. Tel: +1 203 432 6105; Fax: +1 360 838 7861; Email: mark.gerstein@yale.edu

Received February 6, 2002; Revised and Accepted August 8, 2002

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

We present a prototype of a new database tool, GeneCensus, whichfocuses on comparing genomes globally, in terms of the collectiveproperties of many genes, rather than in terms of the attributesof a single gene (e.g. sequence similarity for a particularortholog). The comparisons are presented in a visual fashionover the web at GeneCensus.org. The system concentrates on twotypes of comparisons: (i) trees based on the sharing of generalizedprotein families between genomes, and (ii) whole pathway analysisin terms of activity levels. For the trees, we have developeda module (TreeViewer) that clusters genomes in terms of thefolds, superfamilies or orthologs—all can be consideredas generalized ‘families’ or ‘protein parts’—theyshare, and compares the resulting trees side-by-side with thosebuilt from sequence similarity of individual genes (e.g. a traditionaltree built on ribosomal similarity). We also include comparisonsto trees built on whole-genome dinucleotide or codon composition.For pathway comparisons, we have implemented a module (PathwayPainter)that graphically depicts, in selected metabolic pathways, thefluxes or expression levels of the associated enzymes (i.e.generalized ‘activities’). One can, consequently,compare organisms (and organism states) in terms of representationsof these systemic quantities. Develop ment of this module involvedcompiling, calculating and standardizing flux and expressioninformation from many different sources. We illustrate pathwayanalysis for enzymes involved in central metabolism. We areable to show that, to some degree, flux and expression fluctuationshave characteristic values in different sections of the centralmetabolism and that control points in this system (e.g. hexokinase,pyruvate kinase, phosphofructokinase, isocitrate dehydrogenaseand citric synthase) tend to be especially variable in fluxand expression. Both the TreeViewer and PathwayPainter modulesconnect to other information sources related to individual-geneor organism properties (e.g. a single-gene structural annotationviewer).

	ABSTRACT

INTRODUCTION

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

Advances in sequencing technology have created the opportunityto perform large-scale genome comparisons. Presently, thereare many systems focusing on specific types of comparisons formany genomes [e.g. COG (1), PENDANT (2), or KEGG (3), WIT (4),MUMmer (5)]. Conversely, there are other systems analyzing singlegenomes from many perspectives [e.g. Flybase (6) MIPS (7), orSGD (8), ECOCYC (9)]. We present here a new prototype tool (outlinedgraphically in Fig. 1) that compares multiple genomes throughmultiple and some novel criteria.

	INTRODUCTION

View larger version (42K):
[in this window]
[in a new window]

Figure 1. (Opposite) A pictorial overview of GeneCensus through screenshots. The top image shows the homepage, which, in addition to linking to pages in GeneCensus, also provides links to multiple other bioinformatics resources, including pages on gene expression, protein interactions and pseudogenes. GeneCensus bifurcates into two semi-independent modules, the TreeViewer and PathwayPainter, shown on the second level of the diagram. Information relevant to TreeViewer can be accessed through the secondary modules (as seen on the third tier) such as the OrganismViewer. Similarly, enzyme-specific data, as opposed to genome-specific enzyme data, can be viewed. The fourth and final level of GeneCensus provides more specific information for many of the ORFs in the ORFViewer, as well as some smaller modules with less generalized information. These include information on: (i) transmembrane proteins, (ii) pseudogenes, (iii) thermophile analysis, and (iv) data on folds for both the worm and yeast genome.

Our approach to genome comparison is two-fold. Our first view,TreeViewer, displays genome-wide comparisons through tree buildingbased upon different characteristics of the genomes (10). Thesecharacteristics include broad statistics, such as fold and genecontent and amino acid composition. The trees that we providecan be compared against other information, and dynamically reconfiguredbased on different genomic characteristics. Our second viewer,PathwayPainter, provides the user with an extensive comparisonof genomes in terms of their mRNA expression, flux (11) andpercent identity (PID), in three major metabolic pathways: TCA,glycolysis and pentose phosphate. Both of these views are linkedto additional modules representing more traditional analysisformats. These include modules that examine open reading frames(ORFs), organisms, and various compositions of genomes.

In general, it is relatively difficult to integrate disparateinformation sources into one comprehensive database; it is difficultto determine which data sets will be useful under differentcircumstances. We present some useful examples in the Gene Exploringsection on how one can extract biologically relevant and novelinformation from our database. Nevertheless, these demonstrations,using specific features within GeneCensus, do not provide areason for the inclusion or exclusion of any specific featuresin the database.

FEATURE OVERVIEW: TREEVIEWER

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

The comparative TreeViewer (outlined graphically in Fig. 2)is an online interface for displaying previously computed trees,and also acts as a tool for comparing trees built using differentmethods. The organisms included in the tree server provide fordiverse phylogenetic comparisons. They encompass all three kingdomsof life (Eukarya, Bacteria, Archaea), diverse environments (normalto extreme), and a wide range of genome sizes (0.6–97Mb).

	FEATURE OVERVIEW: TREEVIEWER

View larger version (48K):
[in this window]
[in a new window]

Figure 2. An annotated close-up of the TreeViewer module. The figure highlights the important parts of the web page format. The top bar, which is maintained throughout the site, provides a search option, a help file and links to PartsList (36), NESGC (41) and Molecular Motions Database (48). To manipulate the view of the data on the web page we provide a menu bar to select which type of tree to view and a second menu bar to determine in which secondary dimension to view the tree. In addition, there are multiple color-coded links next to each organism—green for metabolic pathways, blue for the organism page and red for other Yale pages associated with that organism. For examples of the multiple views, we present a COG tree viewed through genome composition and a fold tree viewed in comparison to the traditional ribosomal tree.

The architecture of the tree server is two-dimensional. Thefirst dimension, based on the methods in which the trees arebuilt, include: gene occurrence, fold composition, dinucleotidefrequency, COGs, metabolic pathways and traditional ribosomaltrees. The second dimension provides information with whichwe can compare the trees. These characteristics include: taxonomy,fold composition, COGs, AT-content, genome size and superfamilyoccurrence. Thus, one has the ability to select which tree tobuild, and in which context that tree will be compared. In addition,one has the ability to compare many trees with the traditionalribosomal tree; each of these options can be selected via thelight blue user interface elements. The techniques and proceduresfor building these trees are explained in Lin and Gerstein (10).The trees that are currently available can be subdivided intothe following self-explanatory sections.

Ribosome
These trees are built through comparing the similarity of theribosomal RNA. This traditional method (12) for phylogeneticanalysis is based on the small subunit ribosomal RNA (SSU rRNA).For comparison, the trees based on the large subunit (LSU) arealso provided.

Folds
These trees are built based on the presence or absence of foldsin different organisms, as determined by Hegyi et al. (13).In addition, we compare trees based upon the subdivision offolds into classes (all alpha, all beta, alpha + beta and alpha/beta).A further comparison in this category differentiates betweenthe distance-based and parsimony techniques for tree building.

Superfamilies
Superfamilies are less broad structural groupings than folds,and because of their greater number, they have been found tobe more differentiating, producing trees similar to the traditionalphylogeny. The data were collected using a similar approachto Hegyi and Gerstein (14).

COGs
We also compare the genomes based on the occurrence of orthologousgenes based on COGs, clusters of orthologous genes (1). Treeswere built for the three major types of COGs (i.e. metabolism,cellular processes, information storage and processing), aswell as for the smaller functional categories. We representthese categories using single letters in the user interface.

Composition
These trees are built on the simple composition of the aminoacids and dinucleotides. The trees marked raw are based on theabsolute number of amino acids and dinucleotides. These valuesare used to generate a vector and the calculated distance. Forthe other trees, the numbers calculated are normalized by thetotal number, producing percentages, which were used to generatea distance matrix for tree construction.

ORFs
This set of trees is composed of trees built on the sequencesimilarity of homologous genes. The genes chosen for this comparisonwere present in the genomes only once; thus paralogous geneswere not a factor.

Enzymes
The sequence similarity of individual enzymes in the three centralmetabolic pathways (the TCA cycle, glycolysis and the pentosephosphate pathway) was used to construct these trees.

FEATURE OVERVIEW: PATHWAYPAINTER

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

PathwayPainter provides a multi-organism representation of threemajor metabolic pathways and their component enzymes. (See Fig.3 for a graphical outline.) It is divided into two views: (i)the Pathway View provides a macro-view—a schematic ofeach metabolic cycle flanked by the flux or various expressionvalues for each of the enzymes; (ii) the Enzyme View providesa micro-view; the data (expression, flux, PID) is presentedwith reference to each individual enzyme.

	FEATURE OVERVIEW: PATHWAYPAINTER

View larger version (33K):
[in this window]
[in a new window]

Figure 3. The second major module, the PathwayPainter. As with the TreeViewer, the top menu bar is maintained. This page is built around the metabolic pathway. We present an overview image of two pathways in the center of the page. Flanking the image are the component enzymes, with a value next to each enzyme. Using the menu on the top of the page, the user can select the desired value for those enzymes in that pathway. These include the flux values for multiple organisms, various expression values for yeast and E.coli, and PID variability. Additionally, each enzyme links to an enzyme-oriented page which displays the data in its entirety for that specific enzyme.

Pathway View
The Pathway View allows the user to compare information foreach enzyme in the form of: (i) flux values (normalized, absoluteand standard deviation); (ii) average and standard deviationof gene expression change (DNA and cDNA arrays) (explained below);(iii) PID (between orthologous enzymes in the pathway for multipleorganisms). Presently, the flux and PID information is availablefor Escherichia coli, Saccharomyces cerevisiae, Bacillus subtilis,Haemophilus influenzae and Helicobacter pylori. Two sets ofinformation can be independently selected to display the dataof choice, and these sets are labeled right column and leftcolumn. An overview map of the pathways is available in thecenter column for reference.

Flux analysis
Flux, a measure of the rate at which metabolites are processedto become output, is calculated at a steady state (15). Determinationof flux provides critical information for rational pathway modificationand metabolic engineering (16,17). While there are many publishedmaps of pathways illustrating the processing of basic metabolites,they provide little in terms of describing pathway fluxes underdiverse conditions.

We obtained raw absolute flux values for three organisms (S.cerevisiae,B.subtilis, E.coli) (18–20) (These are reported as ‘absolute’fluxes on the website). For two organisms (H.influenzae andH.pylori), we calculated theoretical relative flux values usingstoichiometric analysis. We describe this calculation here.Our first step involves reconstructing the map of the centralmetabolic pathway in the two organisms using information fromthe KEGG metabolic database (21). It is known that H.influenzae(22) and H.pylori (23) have incomplete TCA cycles. We decomposedthe reconstructed pathway into elementary modes using the METATOOLsoftware (24). Each elementary mode consists of a minimal setof enzymes that could operate at steady state with all irreversiblereactions proceeding in the appropriate direction and furtherreduced to omit extraneous metabolites not necessary for thenet reaction (25). One should note that there is more than oneelementary mode that can connect two chemical species. In orderto choose the best elementary mode that represents the mostefficient routes of chemical conversion, we optimized the processby using a combined objective function of maximization of ATPand minimization of glucose use. We obtained the ratio of theend products of glucose metabolism produced from earlier studies.For example, H.influenzae produces succinate and acetate asthe end products of glucose metabolism in the ratio of 4.3:1(22). These ratios act as constraints in the optimization process.We used the LINDO software to carry out this optimization.

Finally, we merged the results for all five organisms and normalizedthe flux values to make them comparable. Normalization of valuesis done with respect to glucose intake (i.e. the entry of glucosein the metabolic pathway is considered to represent 100% relativeflux). We computed the relative flux that inputs into variouspathway routes as a fraction of this initial amount. Therefore,even though the actual pathway flux can vary from one organismto another, normalized fluxes are comparable in a relative sense.We report these final normalized values on the website.

Expression levels
PathwayPainter encompasses information from a variety of geneexpression experiments, corresponding to data collected withboth DNA and cDNA arrays. In particular, we have collected variousmicroarray data sets from the Stanford Microarray Database (26)and extracted expression data for each enzyme in the followingthree pathways: (i) TCA cycle, (ii) glycolysis and (iii) pentosephosphate. While the determination of gene expression levelsusing high-throughput experimentation is a growing field, wepresent here mainly data sets derived from yeast experiments,the most common organism for expression analysis to date. Thedynamic nature of GeneCensus will allow us to provide additionalmicrobial expression data sets as they become available.

In the current version of GeneCensus we focus on six individualexperiments: (i) cell-cycle experimentation of yeast cells synchronizedby alpha factor arrest (27); (ii) a second cell-cycle experimentwhere yeast cells were similarly synchronized via the arrestof a cdc15 temperature-sensitive mutant (27); (iii) a yeastdiauxic shift experiment measuring the temporal program of geneexpression following a metabolic shift from fermentation torespiration (28); (iv) an assay measuring the change in yeastexpression during sporulation (29); (v) an experiment capturingthe cellular response of E.coli following exposure to UV radiation(30); and (vi) a profile of gene expression in the germ lineof Caenorhabditis elegans (31).

Enzyme View
The Enzyme View of PathwayPainter details the absolute and normalizedflux levels for the enzyme in each genome. Additionally, itallows for the visualization of sequence similarity betweenthe organisms compared with the specific enzyme for which theflux is being measured. Below the PID table, another table outlinesthe expression values of that specific enzyme in yeast and E.coliunder multiple conditions. Finally, links are provided to theTreeViewer wherein the user can view trees based on that particularenzyme.

In relation to gene expression, for each enzyme, we report thefollowing. (i) Raw unscaled values R as available from the varioussites as either copies per cell, log₂(ratio of mRNA levels),or normalized transcript level divided by mean value, dependingon the specific data set. We represent these as P(i, t), whichrepresent the expression of gene i at time t. The calculatedratios are thus R(i, t) = P(i, t)/P(i, r) in which P(i, r) isthe reference state. (ii) Multi-experiment scaled expressionvalue M derived from multiple experiments (32), which providesa standard of comparison for expression data. This is derivedfrom scaling together various microarray and SAGE data setsand is on an absolute scale in copies per cell (32). (iii) Averageexpression ratio change C over of the length of the profile,which is calculated by C = $<$ R(i, t) $>$ . This measures the degreeof variability in expression in a particular experiment fora given enzyme. (iv) Expression ratio fluctuation E was calculatedusing the standard deviation of expression ratios (i.e. .Note that enzymes consistently expressed under most conditionswill show minimal standard deviations that are closely correlatedbetween experiments.

MODULES

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

In addition to the TreeViewer and the PathwayPainter we alsoprovide further subsidiary modules. These are the ORF, Organismand Composition viewers.

	MODULES

ORFViewer
The ORFViewer module provides various resources related to agiven ORF. For example, protein structural annotations are graphicallyrepresented on a plot for every ORF. These include: (i) PDBPSI-BLAST matches; (ii) regions of low complexity [identifiedwith the SEG program using standard parameters K(1) = 3.4, K(2)= 3.75 and a window size of 45 residues (33)]; (iii) transmembranesegments [defined using the GES hydrophobicity scale (34)];(iv) linker regions (e.g. low complexity regions); and (v) uncharacterizedregions.

We also provide organism-specific information. For example,additional information available for yeast includes gene expressiondata, subcellular localization (35), protein structural comparisons(36), previously unannotated genes (37), transposon tagging,gene disruptions (38) and data from protein chip experiments(39). Organism-specific information for Mycoplasma genitaliuminformation includes a breakdown of structural characterizationsof the genome and related NESGC construct database entries (40,41).

The proliferation of multiple interchangeable nomenclature schemesremains an outstanding problem in compiling ORF annotation.For example, YJR155W, AADA_YEAST, AAD10, J2245, P47182 and CAA89688.1all refer to the identical ORF in the yeast genome. As a solution,we include Smartlink in GeneCensus; Smartlink is a translatorthat integrates all the disparate systems and maps one nomenclatureto another.

OrganismViewer
The OrganismViewer integrates varied information about an organism.For each organism, we provide an amino acid composition viewer(see below for more details), links to the source data, andadditional links to external resources. These pages were designedas an open system, allowing users to add relevant resource links.We will continue to integrate more studies and resources foreach genome.

The OrganismViewer not only presents the research specific toGeneCensus, but also integrates research from other institutionsand acts as a central resource and link manager for each genome.For example, the C.elegans page provides comprehensive analysisof pseudogenes and protein folds; alternatively, the S.cerevisiaepage includes studies of the relationship between fold and function,microarray expression data analysis (42), clustering of phenotypepatterns (38), subcellular localization of proteins (35), compositionof individual ORF features (43) and comparison of the yeastand worm genomes in terms of folds (44). We have also includeda statistical analysis of thermophilic and mesophilic genomes(45).

CompositionViewer
In early cryptography, it was discovered that one could breakciphers by analyzing the frequency of letters/symbols and theircombinations within a coded text. When the frequency of a letteror letter combination differs from the expected frequency ofa randomly inserted letter, it can be interpreted as significant—andpossibly deciphered. Similarly, a protein sequence can be decodedand important and significant amino acid combinations discernedby examining the frequency and specific occurrences of entitieswhose occurrence is higher than they would be if inserted randomly(i.e. the expected occurrence of a given amino acid within asequence). These combinations may be important for conceptssuch as binding or protein structure (46). Additionally, determiningthe individual amino acid composition of the ORFs in thermophilicorganisms is essential for understanding their stability inextreme environments.

The CompositionViewer, another module within Gene Census, providescomposition information about genomes similar to that used tobuild and compare trees in the TreeViewer; the CompositionVieweris a valuable tool for analyzing the aforementioned amino acidpair patterns in a genome. Included within this component are(i) calculations of amino acid composition, (ii) secondary structure,and (iii) a tool for dynamically calculating amino acid compositionpairs. Pairs, in terms of amino acid composition, are definedas combinations of an amino acid residue with another residueat separation i, i + k. For instance, AL3 corresponds to analanine residue and a leucine residue at i, i + 3 (AxxL). Thesignificance of over- and under-represented pairs is calculatedby comparing the observed occurrence of the pair in a databasewith a random expectation distribution. The random distributionis calculated as the average of any possible internal permutationsof all sequences of the database. The significance correspondsto the probability of observing the same or a larger differencebetween observed and expected occurrences of a pair in randomsequences.

For membrane data we used the TMSTAT method (43) to analyzefrequently occurring combinations of residues of transmembranedomains. The server returns the most significant over- and under-representedsinglets and pairs in each database.

This viewer is accessed either through the organism viewer ordirectly at http://bioinfo.mbb.yale.edu/genome/tmstat/comp.cgi.

GENE EXPLORING: PRACTICAL RESULTS USING GENECENSUS

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

PathwayPainter illustration
Given the difficulty of describing our database without actuallyproviding a manual, we attempt to provide directions for usingthe data, and a illustration of the utility of the system bypresenting some biologically relevant qualitative conclusionsthat can be extracted from our database.

	GENE EXPLORING: PRACTICAL RESULTS USING GENECENSUS

In Figure 4, we illustrate the scientific conclusions one canderive from PathwayPainter through comparing the variation inexpression, flux and PID of enzymes over many different experimentsand organisms. We present some of our findings here. Additionally,we show how these data may be utilized to determine which enzymescan best be used as internal controls for normalizing microarrays.The data presented in the figure include: (i) the average expressionchange C, which is the average of all the expression ratiosfrom all the time points between two conditions; (ii) expressionfluctuation E, which represents the standard deviation of theexpression ratios from all experimental time points; (iii) fluxvariation F, which indicates the standard deviation of fluxfrom the different organisms; and (iv) sequence similarity S,which is the average percentage similarity of orthologous pairs.Notably, the experiments can be compared not only with the averageexpression change C, but also with expression fluctuation E.Both values are important for a clear understanding of the expressionratio profiles, since enzymes may have very different averageexpression change and expression fluctuation values. For example,in the cdc15-arrested cell-cycle experiment, both citric synthase(4.1.3.7) and glycolaldehyde-transferase (2.2.1.1) exhibit lowaverage expression change but very high expression fluctuation.

View larger version (50K):
[in this window]
[in a new window]

Figure 4. A cross-section of the results that can be seen with the PathwayPainter module. Enzymes were chosen from three metabolic pathways: citric acid cycle (blue), glycolysis (green) and pentose phosphate pathway (red); the information presented includes expression, flux and sequence similarity data. We present expression data, the relative expression of a gene in relation to a control, from six experiments: yeast diauxic shift, yeast sporulation, E.coli UV response, C.elegans mutant germline, and two yeast cell-cycle expression sets. We summarized the data by calculating the standard deviation and the average for each enzyme profile in each experiment, as well as combined statistics for all the experiments. Values in the top quartile were shaded black, in the middle two, gray, and in the bottom quartile, white. Sequence similarities of the enzymes were calculated by averaging the percentage sequence identity between orthologous genes. These are shaded in the same fashion as the expression values. For the flux values, we calculated the standard deviation of the flux values for all organisms examined. Values in the top quartile are colored aqua, middle two quartiles, yellow, and bottom quartile, purple. We show a schematic of all three pathways, with enzyme numbers color coded by pathway. The arrows representing the reaction are colored by the degree of flux variation; this seems to correlate closely with the pathways. TCA shows the greatest flux values and the pentose phosphate pathway comprises the lowest. The pink crosses label all the irreversible control points in the metabolic system. The average PID of the enzyme seems to have little correlation with the expression or flux values. Clearly, the figure also shows a relationship between flux and the enzyme’s resources, including placement in the overall pathway structure.

We make five key observations: (i) experiment comparison, (ii)subsystem analysis, (iii) enzyme comparisons (control pointcharacteristics), (iv) PID and (v) normalization.

Experiment comparison. In Figure 4, PathwayPainter is used tocompare six expression experiments. There are some notable globaldifferences between these experiments. Both the E.coli and theworm expression sets show higher average expression change C,reflecting the changes in worm development and the effects ofUV on E.coli. Conversely, the cell-cycle experiments show smalleraverage expression changes, reflecting the more constant stateof housekeeping genes (e.g. metabolic pathway enzymes) withinthe cell. As expected in the diauxic shift experiment, the TCAcycle enzymes have high values for average expression changesC and expression fluctuations E; this substantiates previousobservations that the change in medium for yeast increases theexpression of TCA enzymes (28). In the yeast sporulation experiments,the positive and negative values in the average expression changecapture the up- and down-regulation of different enzymes inthe system.

Subsystem analysis. Different pathways or subsystems of centralmetabolism exhibit specific trends and characteristics. Figure4 highlights one of these characteristics by coloring the arrowsaccording to their flux variation F, with the highest valuesdepicted in blue, median values in green and the lowest in red.From the schematic, as well as the table, one can see that theTCA cycle has all of the highest flux variation F, indicatingthat that the TCA cycle changes the most in metabolite processing.Biologically, this confirms the notion that the TCA cycle functionsvery differently depending on environmental factors, such asaerobic and anaerobic conditions. The flux variation for glycolyticenzymes is near the average, except for triphosphate isomerase(5.3.1.1), which provides a shunt in the pathway. Similarly,the expression fluctuation, E, also correlates well with thedivision into subsystems; most of the highest values (>0.47)belong to the TCA cycle, the lowest values (<0.38) belongto the pentose phosphate pathway and the middle values belongto glycolysis. Expression fluctuation and flux variation clusterssimilarly to pathway divisions and correlate well with eachother.

Enzyme comparisons. PathwayPainter also allows for multiplecomparisons of specific enzymes. Expression variation is particularlyevident at branch points and control points (where reactionsare essentially irreversible). These points represent thoseenzymes with the greatest expression fluctuations (both C andE). For example, isocitrate dehydrogenase (1.1.1.42) and citricsynthase (4.1.3.7), which have the two highest average expressionchanges, are both important control points in central metabolism.We performed additional analyses on other control point proteinsand found that their average expression changes were very highas well. For example, phosphofructokinase (2.7.1.11), hexokinase(2.7.1.2) and pyruvate kinase (2.7.1.40) have values of 1.1,0.6 and 0.6, respectively, representing an almost 100% increasecompared to many of the other enzymes. In situations where thecell is perturbed, the response is reflected in the change ofexpression in the control point enzymes; in the worm development,E.coli response to UV, and yeast sporulation experiments, anincrease in average expression change C at branch points (4.1,3.7,1.1.1.37) is observed.

By comparing expression variability of an enzyme across multipledata sets, we have shown that many of the important metaboliccontrol points are most variable in terms of expression. Thisvariability indicates the intricate regulation of these enzymes.

Percentage identity. We found that sequence similarity S doesnot correlate strongly with either flux variation F or averageexpression change C. We conclude that sequence identity is nota predictor of expression or flux values.

Normalization. The expression-variability data in Pathway Paintercan also be used to assess whether a group of genes can be appliedto the normalization of microarray data between experimentsin different organisms and under different experimental conditions.Presently, there are many efforts underway to determine robustmethods to normalize microarray data through internal controls(47). For example, attempts have been made to establish normalizationbased on housekeeping genes, that is, those genes thought tobe consistently expressed in the vast majority of conditions.We propose that the detailed study of the expression variabilityin metabolic enzymes shown by the PathwayPainter module canbe useful in determining which enzymes could potentially beused as constants in microarray normalization approaches.

TreeViewer illustration
Informative pan-genomic analyses can be performed with the GeneCensusTreeViewer module. For example, the traditional ribosomal treegroups Gram-positive bacteria into one homogeneous group. However,further analysis using other types of trees subdivides themin informative ways. In particular, a number of the trees availablein GeneCensus show that the bacteria B.subtilis and Mycobacteriumtuberculosis tend to cluster independently of the other Gram-positivebacteria M.genitalium and Mycoplasma pneumoniae. This is a productof the radically differing size and gene compositions spanningthe Gram-positive class. Further analysis of Gram-positive bacteriausing the composition module shows that these two organismshave a high percentage of guanine as opposed to the other Gram-positivebacteria. Thus, while we may link them together due to the highpeptoglycan content in their cellular walls (i.e. resultingin a Gram-positive stain), using the multiple modules in GeneCensus,we show that they differ radically in many other genomic properties.

CONCLUSION: GENECENSUS, A COMPARATIVE GENOMICS DATABASE AND TOOL

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

GeneCensus is part of the new generation of tools that helpresearchers navigate this post-genomic world. Overall, GeneCensusprovides many levels of information. Whether one is interestedin comparing genomes based on whole genome or pathway properties,looking at flux or expression information in specific organisms,or studying specific genes or pathways, the fluid incorporationof many different sources of data in GeneCensus allows researchersto view, or dynamically calculate, their information of interest,as well as investigate related data ranging from the amino acidlevel to multiple organism comparisons. Researchers can thusgain global system views and put their research interest intoperspective within the vast sea of genomic data now available;through GeneCensus they can integrate varied data sources inan effort to actualize genome annotation. Given the high degreeof false negatives and positives in many of the high-throughputgenomic experiments, much of the data cannot, on its own merits,provide information for annotation. Fortunately, the noise thataccompanies all of these experiments is not systematic, andthe integration of the various data sets in, for instance, thephylogenetic analyses, allows users to cross-validate and improvethe accuracy of their results.

	CONCLUSION: GENECENSUS, A COMPARATIVE GENOMICS DATABASE AND TOOL

ACKNOWLEDGEMENT

M.G. acknowledges support from the NIH (P01 GM54160-02).

	ACKNOWLEDGEMENT

REFERENCES

TOP
ABSTRACT
INTRODUCTION
FEATURE OVERVIEW: TREEVIEWER
FEATURE OVERVIEW: PATHWAYPAINTER
MODULES
GENE EXPLORING: PRACTICAL...
CONCLUSION: GENECENSUS, A...
REFERENCES

	REFERENCES

Wheeler,D.L., Church,D.M., Lash,A.E., Leipe,D.D., Madden,T.L., Pontius,J.U., Schuler,G.D., Schriml,L.M., Tatusova,T.A., Wagner,L. et al. (2002) Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res., 30, 13–16.[Abstract/Full Text]
Wixon,J. and Kell,D. (2000) The Kyoto encyclopedia of genes and genomes—KEGG. Yeast, 17, 48–55.[Medline]
Frishman,D., Albermann,K., Hani,J., Heumann,K., Metanomski,A., Zollner,A. and Mewes,H.W. (2001) Functional and structural genomics using PEDANT. Bioinformatics, 17, 44–57.[Abstract]
Overbeek,R., Larsen,N., Pusch,G.D., D’Souza,M., Selkov,E.,Jr, Kyrpides,N., Fonstein,M., Maltsev,N. and Selkov,E. (2000) WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res., 28, 123–125.[Abstract/Full Text]
Delcher,A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30, 2478–2483.[Abstract/Full Text]
Gelbart,W.M., Crosby,M., Matthews,B., Rindone,W.P., Chillemi,J., Russo Twombly,S., Emmert,D., Ashburner,M., Drysdale,R.A., Whitfield,E. et al. (1997) FlyBase: a Drosophila database. The FlyBase consortium. Nucleic Acids Res., 25, 63–66.[Abstract/Full Text]
Mewes,H.W., Frishman,D., Guldener,U., Mannhaupt,G., Mayer,K., Mokrejs,M., Morgenstern,B., Munsterkotter,M., Rudd,S. and Weil,B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 30, 31–34.[Abstract/Full Text]
Dwight,S.S., Harris,M.A., Dolinski,K., Ball,C.A., Binkley,G., Christie,K.R., Fisk,D.G., Issel-Tarver,L., Schroeder,M., Sherlock,G. et al. (2002) Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30, 69–72.[Abstract/Full Text]
Karp,P.D., Riley,M., Saier,M., Paulsen,I.T., Collado-Vides,J., Paley,S.M., Pellegrini-Toole,A., Bonavides,C. and Gama-Castro,S. (2002) The EcoCyc Database. Nucleic Acids Res., 30, 56–58.[Abstract/Full Text]
Lin,J. and Gerstein,M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res., 10, 808–818.[Abstract/Full Text]
Holms,H. (2001) Flux analysis: a basic tool of microbial physiology. Adv. Microb. Physiol., 45, 271–340.[Medline]
Woese,C.R. and Fox,G.E. (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA, 74, 5088–5090.[Medline]
Hegyi,H., Lin,J., Greenbaum,D. and Gerstein,M. (2002) Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins, 47, 126–141.[Medline]
Hegyi,H. and Gerstein,M. (2001) Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res., 11, 1632–1640.[Abstract/Full Text]
Stephanopoulos,G., Aristidou,A. and Nielsen,J. (1998) Metabolic Engineering: Principles and Methodologies, 1st Edn. Academic Press, San Diego.
Stephanopoulos,G. and Gill,R.T. (2001) After a decade of progress, an expanded role for metabolic engineering. Adv. Biochem. Eng. Biotechnol., 73, 1–8.[Medline]
Schilling,C.H., Schuster,S., Palsson,B.O. and Heinrich,R. (1999) Metabolic pathway analysis: basic concepts and scientific applications in the post-genomic era. Biotechnol. Prog., 15, 296–303.[Medline]
Schilling,C.H., Edwards,J.S. and Palsson,B.O. (1999) Toward metabolic phenomics: analysis of genomic data using flux balances. Biotechnol. Prog., 15, 288–295.[Medline]
Sauer,U., Hatzimanikatis,V., Hohmann,H.P., Manneberg,M., van Loon,A.P. and Bailey,J.E. (1996) Physiology and metabolic fluxes of wild-type and riboflavin-producing Bacillus subtilis. Appl. Environ. Microbiol., 62, 3687–3696.[Abstract/Full Text]
Nissen,T.L., Schulze,U., Nielsen,J. and Villadsen,J. (1997) Flux distributions in anaerobic, glucose-limited continuous cultures of Saccharomyces cerevisiae. Microbiology, 143, 203–218.[Abstract]
Kanehisa,M., Goto,S., Kawashima,S. and Nakaya,A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res., 30, 42–46.[Abstract/Full Text]
Tuyau,J.E., Sims,W. and Williams,R.A. (1984) The acid end-products of glucose metabolism of oral and other haemophili. J. Gen. Microbiol., 130, 1787–1793.[Medline]
Kelly,D.J. (1998) The physiology and metabolism of the human gastric pathogen Helicobacter pylori. Adv. Microb. Physiol., 40, 137–189.[Medline]
Pfeiffer,T., Sanchez-Valdenebro,I., Nuno,J.C., Montero,F. and Schuster,S. (1999) METATOOL: for studying metabolic networks. Bioinformatics, 15, 251–257.[Abstract/Full Text]
Schuster,S., Dandekar,T. and Fell,D.A. (1999) Detection of elementary flux modes in biochemical networks: a promising tool for pathway analysis and metabolic engineering. Trends Biotechnol., 17, 53–60.[Medline]
Sherlock,G., Hernandez-Boussard,T., Kasarskis,A., Binkley,G., Matese,J.C., Dwight,S.S., Kaloper,M., Weng,S., Jin,H., Ball,C.A. et al. (2001) The Stanford Microarray Database. Nucleic Acids Res., 29, 152–155.[Abstract/Full Text]
Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297.[Abstract/Full Text]
DeRisi,J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680–686.[Abstract/Full Text]
Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705.[Abstract/Full Text]
Courcelle,J., Khodursky,A., Peter,B., Brown,P.O. and Hanawalt,P.C. (2001) Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient Escherichia coli. Genetics, 158, 41–64.[Abstract/Full Text]
Reinke,V., Smith,H.E., Nance,J., Wang,J., Van Doren,C., Begley,R., Jones,S.J., Davis,E.B., Scherer,S., Ward,S. et al. (2000) A global profile of germline gene expression in C. elegans. Mol. Cell, 6, 605–616.[Medline]
Greenbaum,D., Jansen,R. and Gerstein,M. (2002) Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics, 18, 585–596.[Abstract/Full Text]
Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571.[Medline]
Engelman,D.M., Steitz,T.A. and Goldman,A. (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu. Rev. Biophys. Biophys. Chem., 15, 321–353.[Medline]
Drawid,A. and Gerstein,M. (2000) A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Biol., 301, 1059–1075.[Medline]
Qian,J., Stenger,B., Wilson,C.A., Lin,J., Jansen,R., Teichmann,S.A., Park,J., Krebs,W.G., Yu,H., Alexandrov,V. et al. (2001) PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res., 29, 1750–1764.[Abstract/Full Text]
Kumar,A., Harrison,P.M., Cheung,K.H., Lan,N., Echols,N., Bertone,P., Miller,P., Gerstein,M.B. and Snyder,M. (2002) An integrated approach for finding overlooked genes in yeast. Nat. Biotechnol., 20, 58–63.[Medline]
Ross-Macdonald,P., Coelho,P.S., Roemer,T., Agarwal,S., Kumar,A., Jansen,R., Cheung,K.H., Sheehan,A., Symoniatis,D., Umansky,L. et al. (1999) Large-scale analysis of the yeast genome by transposon tagging and gene disruption. Nature, 402, 413–418.[Medline]
Zhu,H., Klemic,J.F., Chang,S., Bertone,P., Casamayor,A., Klemic,K.G., Smith,D., Gerstein,M., Reed,M.A. and Snyder,M. (2000) Analysis of yeast protein kinases using protein chips. Nature Genet., 26, 283–289.[Medline]
Balasubramanian,S., Schneider,T., Gerstein,M. and Regan,L. (2000) Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome. Nucleic Acids Res., 28, 3075–3082.[Abstract/Full Text]
Bertone,P., Kluger,Y., Lan,N., Zheng,D., Christendat,D., Yee,A., Edwards,A.M., Arrowsmith,C.H., Montelione,G.T. and Gerstein,M. (2001) SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res., 29, 2884–2898.[Abstract/Full Text]
Gerstein,M. and Jansen,R. (2000) The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? Curr. Opin. Struct. Biol., 10, 574–584.[Medline]
Senes,A., Gerstein,M. and Engelman,D.M. (2000) Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at neighboring positions. J. Mol. Biol., 296, 921–936.[Medline]
Gerstein,M., Lin,J. and Hegyi,H. (2000) Protein folds in the worm genome. Pac. Symp. Biocomput., 30–41.
Das,R. and Gerstein,M. (2000) The stability of thermophilic proteins: a study based on comprehensive genome comparison. Funct. Integr. Genomics, 1, 76–88.[Medline]
Bussemaker,H., Li,H. and Siggia,E. (2000) Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl Acad. Sci. USA, 97, 10096–10100.[Abstract/Full Text]
Yang,M.C., Ruan,Q.G., Yang,J.J., Eckenrode,S., Wu,S., McIndoe,R.A. and She,J.X. (2001) A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol. Genomics, 7, 45–53.[Abstract/Full Text]
Krebs,W.G. and Gerstein,M. (2000) The morph server: a standardized system for analyzing and visualizing macromolecular motions in a database framework. Nucleic Acids Res., 28, 1665–1675.[Abstract/Full Text]

Abstract of this Article

Reprint (PDF) Version of this Article