HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|
Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA, 1Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower St, London WC1E 6BT, UK and 2European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received November 15, 2000; Revised and Accepted February 27, 2001.
ABSTRACT |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
INTRODUCTION |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
In a general sense, how should one approach the analysis of molecular parts? A simple analogy to mechanical parts may be useful in this regard. Given the parts from a number of devices (e.g. a car, a bicycle, and a plane) one might like to know which ones are shared by all and which are unique (say, wings for a plane). Furthermore, one might want to know which are common, generic parts and which are more specialized. Finally, one might like to organize the parts by a number of standardized attributes (e.g. the most flexible parts, the parts with the most functions, and the biggest parts). PartsList aims to provide answers to simple questions such as these for the domain of protein folds.
Properties related to protein folds can be divided into those that are intrinsic versus extrinsic. Intrinsic information concerns an individual fold itself, e.g. its sequence, 3D structure and function, while extrinsic information relates to a fold in the context of all other folds, e.g. its occurrence in many genomes and expression level in relation to that for other folds. Web-based search tools already provide intrinsic information about protein structures in the form of reports about individual structures. Valuable examples include the PDB Structure Explorer (5), PDBsum (6) and the MMDB (7). However, current resources lack the ability to fully present extrinsic information.
Likewise, while there are many databases storing information related to individual organisms (e.g. SGD, MIPS and FlyBase; 810), comparative genomics (PEDANT and COGs; 9,11), gene expression (GEO, the Gene Expression Omnibus at the NCBI, and ExpressDB; 12) and proteinprotein interactions (DIP and BIND; 13,14), none of these integrates gene sequences, protein interactions, expression levels and other attributes with structure. (However, it should be mentioned that the Sacc3D module of SGD and PEDANT do tabulate the occurrence of folds in genomes.)
PartsList is arranged somewhat differently from most other biological resources. In a usual database (e.g. GenBank; 15) the number of entries increases as the database develops, while each entry has a fairly fixed number of attributes to describe it. In contrast, PartsList is envisioned to have a relatively stable number of entries, i.e. the finite list of protein folds, while the attributes that describe each entry are expected to increase considerably. In the current version of PartsList the properties for a protein fold include: amino acid composition, alignment information, fold occurrences in various genomes, statistics related to motions, absolute expression levels of yeast in different experiments, relative expression ratios for yeast, worm and Escherichia coli in various conditions, information on proteinprotein interactions (based on whole-genome yeast interaction data and databank surveys) and sensitivity of the genes associated with the fold to inserted transposons.
One reason to build the database is to compare protein folds in a rich context and in a unified way. This was achieved through ranking. This allows users to directly compare very different attributes of a fold in a uniform numerical format. The rankings can be visualized in three ways: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a rankings comparer for custom comparisons and a numerical rankings correlator. This can help users gain insight into the functions of protein folds in the context of the whole genome. Our system makes it very easy to answer questions like: What is the most common fold in the worm as compared to E.coli?, What is the most highly expressed fold in yeast and how does this compare to the fold that changes most in expression level during the cell-cycle? and Which fold has the most proteinprotein interactions in the PDB and is it highly ranked in terms of protein motions?
One of the strengths of the uniform numerical system of ranks in PartsList is that it puts everything into a common framework so that one can see hidden similarities in the occurrence of parts ordered according to many different attributes. In particular, as we describe below, we found that the frequency of many of the attributes falls off according to a power-law distribution (i.e. according to Vb, for attribute value V and a constant b), with a few folds having large attribute values and most having small values. For instance, there are only a few folds that occur many times in the yeast genome and most only occur once or twice. Likewise, most folds only have a few functions associated with them, but there are a few Swiss-army-knife folds that are associated with many distinct functions. Similar power-law-like expressions have been found to apply in a variety of other situations relating to proteins, for instance, in the occurrence of oligo-peptide words (1618), in the frequency of transmembrane helices (19) and sequence families with given size (20), and in the structure of biological networks, with a few nodes having many connections and most have only a few (21,22).
PartsList is built on top of the Structural Classification of Proteins (SCOP) (23) fold classification and acts as an accompanying annotation to this system. SCOP is divided into a hierarchy of five levels: class, fold, superfamily, family and protein. The parts in our system can be either SCOP folds or superfamilies. However, sometimes for ease of expression we will just refer to folds when we really mean folds and/or superfamilies. We currently use 420 folds and 610 superfamilies in PartsList. Each is represented by a representative domain, which is also the key for each entry of protein fold.
While we chose to use the SCOP classification, we could equally well have based the system on the other existing fold classifications, e.g. CATH (24), FSSP (25) or VAST (26,27). Moreover, for most attributes, we could also have developed our system around non-structural classifications of protein parts, e.g. Pfam (28), Blocks (29) or SMART (30). However, basing it around actual structural folds has the advantage that each part is more precisely and physically defined.
ATTRIBUTES THAT CAN BE RANKED: INFORMATION IN THE SYSTEM |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
We have developed a formalism for expressing each of the attributes, which is described in Table 1. In the table, the term PART refers to either fold or superfamily, depending on which of these is being ranked. Essentially, we have a database of attributes where each attribute is given a standardized description and associated with a precise reference. In the following, we describe some main categories of attributes.
|
The data were obtained in the following fashion: Once a library of folds has been constructed, representative sequences can be extracted (31). Then one can use these to search genomes by comparing each representative sequence against the genomes using the standard pairwise comparison programs, FASTA (32) and BLAST (33) and well-established thresholds (34).
Alternatively, one can build up profiles by running each representative sequence against PDB with PSI-BLAST and then comparing these profiles against each of the genomes. This latter procedure is more sensitive than pairwise comparison and relatively efficient once the profiles are made up. However, in doing large-scale surveys one has to be conscious of the potential biases introduced due to the profiles being more sensitive for larger families, which often results in the big families getting even bigger.
After the structure assignment, it becomes easy to enumerate how often a fold or structure feature occurs in a given genome or organism. Detailed information can be found in previous reports (H.Hegyi, J.Lin and M.Gerstein, manuscript submitted; 19,35,36). This pools assignments from previous work (37,38).
Alignment
Number of structures. We did a comprehensive set of structural alignments of structures in the PDB structure databank (3941). The number of structures and aligned pairs used in these comparisons, which are based around Astral (31), give approximate measures of the occurrence of folds in the PDB. Comparison of these values to those for genome occurrence provides a measure of how biased the composition of the PDB is (42).
Sequence diversity. The scores from the alignments indicate the sequence diversity between the related structures within folds or superfamilies, in terms of percentage sequence identity and a sequence-based P value. P values are useful measures of statistical significance of the similarity calculation. A P value is the probability that one can obtain the same or better alignment score from a randomly composed alignment. A smaller P value is less likely to have been obtained by chance than a larger P value. Large P values close to 1.0 indicate that the similarity is characteristically random and thus insignificant.
Structural diversity. We also give analogous measures of the diversity of the structures with a given fold, allowing one to rank folds by their degree of variability. We tabulate untrimmed and trimmed RMS, along with the structural P value. RMS, root-mean-squared deviation in carbon positions, has been the traditional statistic that gauges the divergence between two related structures. Smaller RMS scores indicate more closely related structures. However, sometimes a few ill-fitting atoms may significantly increase the RMS of structures known to be similar. To compensate for this we also report a trimmed RMS for a conserved core structure, which is based on the better fitting half of the aligned carbons, and structural P value, which compensates for other effects such as structure size. For details, see Wilson et al. (39).
Composition
This allows us to see which folds are most biased in composition of particular amino acids. We use various levels of the Astral clustering of the SCOP sequences to arrive at the composition (31).
Expression
Three techniques are frequently used to obtain genome-wide gene expression data. They are Affymetrix oligonucleotide gene chips, Serial Analysis of Gene Expression (SAGE) and cDNA microarrays (4345). SAGE and, to some degree, gene chips measure the absolute expression levels (in units of mRNA transcripts per cell), while microarrays are used to obtain the expression level changes of a given open reading frame (ORF) as the ratio to a reference state.
A main motivation for expression experiments is often to study protein function and to characterize the functions of unannotated genes. However, this does not preclude relating other attributes of proteins, such as their structure, to expression data. For instance, it may be that highly expressed protein folds share a number of characteristics, such as a particularly stable architecture or a composition biased in a certain way. Relating expression and structure involved matching the PDB structure database against the genome and then summing the expression levels of all ORFs containing the same fold. However, if one is trying to find genes expressed in a particular metabolic state, PartsList is not the right place to look.
Absolute. The absolute expression level data gives a good representation of highly expressed genes. All the experiments currently indexed by PartsList are for yeast. For each experiment, in addition to ranking based on the average expression level for a fold, we also consider the composition in the transcriptome and the enrichment of this value relative to its composition in the genome. Transcriptome composition is the fractional composition of a fold (relative to that for other folds) in the mRNA population. In other words, it is the composition of a fold in the genome weighted by the expression levels of each of the genes. The enrichment is the relative change between the composition of a fold in the genome and the transcriptome. Further details are provided in previous reports (46,47). We report values for experiments from a number of different labs (43,4850) and a single reference set that merges and scales all the expression sets together.
Ratio. The expression ratio data shows the most actively changing genes over a period of time (e.g. cell cycle) or based on a change in states (e.g. healthy versus diseased). Source data for expression ratios are the fluctuations in expression of a certain fold over a period of time (e.g. the cell cycle). These are measured in terms of standard deviations for a particular fold, which is calculated from the average of the expression ratio standard deviations for each gene that matches the fold structure.
Interactions
Information on proteinprotein interactions is derived from surveys of the contacts in the PDB and the experiments in yeast.
PDB. To determine which domains interact with one another in the PDB entries indexed by SCOP (9580 at the time of the analysis), the coordinates of each domain were parsed to check whether there are five or more contacts within 5 Å to another domain, as described by Park et al. (51). The distance of 5 Å was chosen, as this is a conservative threshold for interaction between two atoms, where the atoms are either Cs or atoms in side chains. The five-contact threshold was chosen to make sure the contact between the domains was reasonably extensive. (In fact, the number of domains identified as contacting each other hardly changed for thresholds between one and 10 contacts and 3 and 6 Å distances.)
Yeast. The interactions between structural domains in the yeast genome were obtained by assigning protein structures to the yeast proteins using PSI-BLAST and PDB-ISL as described by Teichmann et al. (52,53). Assigned structural domains contained within the same ORF that were adjacent within 30 amino acids were assumed to interact. (This is generally true of the domains in the PDB, with a few exceptions, such as domains in transcription factors like adjacent zinc fingers or variable and constant immunoglobulin domains.) To derive intermolecular interactions in the yeast genome we combined three sets of proteinprotein interactions: (i) the MIPS web pages on complexes and pairwise interactions (February 2000) (9), (ii) the global yeast two-hybrid experiments by Uetz et al. (54) and (iii) large-scale yeast two-hybrid experiments by Ito et al. (55). Out of all these pairwise interactions known for yeast ORFs, there is a limited set in which both partners are completely covered by one structural domain (to within 100 residues). This set of protein pairs was used to derive a further set of domain contacts in the yeast genome as described by Park et al. (51).
Motions
Information on motions is from the Macromolecular Motions Database (56,57). We consider a set of approximately 4400 motions automatically identified by examining the PDB and a smaller, manually curated set of motions. For each fold we determine the number of entries in the motions database that are associated with it. Then, over this set of motions we either average or take the maximum value of a number of relevant statistics describing the motion, i.e. the maximum C displacement in the motion, the overall rotation of the motion and the energy difference between the start and endpoints of structures involved in the motion.
Transposon sensitivity
Ross-Macdonald et al. (58) developed a procedure for randomly inserting transposons throughout the yeast genome. They investigated the phenotypes resulting from each insertion in 20 different growth conditions in comparison to wild-type growth. The experiment for each insertion in each condition was repeated several times. If the observed phenotype of the mutant deviates from the average wild-type phenotype, this could be either because of a real effect of the mutation on the cell or it could just be a typical variation of the phenotype of wild-type cells. We developed a P value score that measures the degree of confidence that the observed phenotype results from randomly changing wild-type cells. The negative logarithm of this P value rises with the significance of the phenotype measurements and can be understood as the sensitivity of the cell to mutations in a particular gene. We calculated a value for the transposon sensitivity for protein folds by geometrically averaging the P values of the associated genes.
Miscellaneous
The miscellaneous section includes any information that does not fit into a major category. It includes: number of pseudogenes in worm associated with a fold (59), total number of functions and number of enzymatic functions associated with a fold (60), the average length of the sequence, and the year the domain structure was originally determined.
Errors
The above data, of course, have systematic and statistical errors. For some attributes we expect considerably smaller errors than others. For instance, we expect the numbers related to the sequence composition of different folds (e.g. the Ala composition) to be particularly accurate, since the only factors affecting these are errors in the underlying sequence of the protein and in the SCOP fold classification itself. In contrast, there is a considerable known rate of false positives associated with the global protein interaction experiments using the two-hybrid method (54,61), and this suggests statistics based on yeast interactions may be somewhat less accurate. Furthermore, the precise values for the rankings in PartsList are also contingent on the evolving contents of various databanks. Thus, over time as more structures are determined, one should expect statistics such as the most common folds in a particular genome to change somewhat. A very detailed discussion of the expected errors in the various quantities in PartsList is available on the web from the help section.
RANKING ALL THE FOLDS BASED ON EXTRINSIC INFORMATION |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
|
|
Correlator
Correlator uses linear and rank correlation coefficients to measure the association between two selected attributes. The difference between these two types of correlation coefficients is that the former relates to the actual values while the latter relates to the ranks among the samples. The interpretation of the linear correlation coefficient can be completely meaningless if the joint probability distribution of the variables is too different from a binormal distribution. This is the reason for introducing the rank correlation coefficient. Correlator provides both coefficients for the selected quantities. In most cases, they are close. For example, the linear correlation coefficient and rank correlation coefficient for fold occurrence in genomes Archaeoglobus fulgidus and Methanococcus jannaschii (Aful and Mjan) are 0.88 and 0.77, respectively, while the corresponding coefficients for fold occurrence in A.fulgidus and Saccharomyces cerevisiae (Scer) are 0.52 and 0.48, respectively. This is not surprising, as the first two genomes are both Archaeal, while in the second comparison one genome belongs to Archaea (Aful) and another to Eucarya (Scer). As one would expect, the fold occurrences for the more closely related genomes have a higher correlation.
In addition to the coefficients, Correlator displays a scatter plot to aid in visualizing the correlation between the selected fold attributes. Figure 2C shows the scatter plot for the second example above: the correlation between occurrences in the A.fulgidus and S.cerevisiae genomes. One can easily observe that some folds appear frequently in Scer but seldom or never in A.fulgidus. By clicking on a point on the plot, one obtains detailed information about the corresponding fold. This kind of plot can reveal interesting folds with certain relationships between attributes even though in some cases the overall correlation coefficients between the two attributes are almost zero (i.e. no correlation).
POWER-LAW BEHAVIOR OF MANY DISPARATE ATTRIBUTES |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
|
F(V) = aVb
where a and b are constants. Note that F(V) is just the number of folds with an attribute value V divided by the total number of folds and that on a loglog plot this function becomes a straight line with slope b. Often the attribute value V itself reflects the occurrence of a fold in a particular context, e.g. V could be the number of times a given fold occurs in a particular genome. Quantities that follow a power-law-like behavior are often said to have a form like that of Zipfs law, which often occurs in the analysis of word frequency in documents (62).
Thus far, this general conclusion is described in language sufficiently abstract to accommodate the many different types of attributes in PartsList. A few concrete examples will make the conclusion clearer. For instance, we find that in genomes most folds occur only once while there are only a very few folds that occur many times. An illustration is shown in the upper panel of Figure 5 for E.coli. The x-axis is the number of times a particular fold occurs in the E.coli genome and the y-axis shows the number of distinct folds that have same occurrence. (This is normalized by dividing by the total number of folds so that the maximum value on y-axis is 100%.) From the loglog format of the plot, one can immediately see that the falloff obeys a power-law, with a few folds occurring many times and most only once or twice. The middle panel shows other attributes that display similar power-law-like behavior, including expression level in yeast, number of functions associated with a fold, and number of proteinprotein interactions found in the PDB. Of course, not all attributes follow a power-law. The lower panel shows two of these less typical attributes: Asp composition in a fold and average number of residues involved in a motion.
|
TRADITIONAL SINGLE-STRUCTURE REPORTS |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
Occurrence report. This allows users to see the number of times that a fold corresponding to the queried protein structure occurs in various genomes. This gives a phylogenetic profile of the occurrence of a particular fold in 20 genomes, similar in spirit to the fold patterns discussed earlier (19).
Function report. This summarizes the functional classification of the queried PDB structure. It merges a number of functional classifications, including FlyBase (10), ENZYME (63), GenProtEC (64) and MIPS (9). Our approach to functional classification is described in a number of previous publications (e.g. 39,60). In short, we used pairwise comparison to cross-reference the PDB domains against SWISS-PROT. Depending on whether they had an Enzyme Commission (EC) number, we were able to divide all entries into enzymes and non-enzymes, a division that represents the highest level in our classification. (For the enzyme category, we only transferred EC numbers to those SCOP domains with a one-to-one match to a SWISS-PROT enzyme.) In the absence of an EC-type classification for non-enzymes, we assigned functions to non-enzymatic SCOP domains according to Ashburners original classification of Drosophila protein functions. This classification is derived from a controlled vocabulary of fly terms, is available on the web and is loosely connected with the FlyBase database (10). It has recently been superceded by the GO functional classification (65). MIPS and GenProtEC classifications to SCOP domains were assigned based on sequence comparisons to classified yeast and E.coli ORFs, respectively. The SCOP domain most closely matching each ORF classified in MIPS or GenProtEC was assigned the corresponding MIPS or GenProtEC function number. Only matches of 80% sequence identity were considered.
Alignment report. This gives detailed information on structural alignments available between pairs of protein domains associated with a fold. A pair viewer is provided, which gives many key statistics about the alignment (e.g. RMS, sequence identity, number of fit atoms, etc.), in addition to a listing of the actual aligned residues. Both HTML and parseable text views are available.
Interaction report. This shows all the pairs of proteinprotein interactions associated with a fold based on either the PDB survey or yeast genome data.
Rank report. This highlights the top-five and bottom-five ranked attributes associated with a fold. It also shows all attributes ordered by the rank they are given in that fold. Thus, it highlights for a particular fold the attributes with respect to which it most stands out. That is, it highlights the outlier attributes of each fold, the way each fold is most unique. The rank report could be used, for example, by a protein engineer interested in determining the unique properties of a structure he is working on.
PDB report. This summarizes all the information concerning a domain or a representative PDB structure. It includes: (i) a summary of the occurrence report; (ii) a summary of the alignments available for structures in the same superfamily and fold; (iii) a description of motions and motion-movies associated with the structure in the Macromolecular Motions database (56,57); (iv) a summary of the merged functional classification; (v) a core structure, if available (66); (vi) ranking tables of the queried structure in various datasets; and (vii) a summary of the interactions report. Figure 4 shows a sample PDB report for structure 1AMA.
|
Fold report. This lists all the SCOP domains associated with the queried fold and provides information (similar to that in the PDB report) that is common to all, i.e. genome occurrence, alignment report and rankings.
DISCUSSION |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
We anticipate that PartsList will have a relatively stable number of entries (i.e. folds), while for each entry the attributes that describe it will increase over time. In the future, as experiments yield new information, PartsList will include more and more attributes. In particular, we anticipate that much new expression information will be incorporated. We also plan to develop a form to allow automatic submission of new ranking attributes and to encourage people to submit any ranking information.
ACKNOWLEDGEMENTS |
---|
FOOTNOTES |
---|
REFERENCES |
---|
TOP ABSTRACT INTRODUCTION ATTRIBUTES THAT CAN BE... RANKING ALL THE FOLDS... POWER-LAW BEHAVIOR OF MANY... TRADITIONAL SINGLE-STRUCTURE... DISCUSSION REFERENCES |
---|
2 Brenner,S.E., Hubbard,T., Murzin,A. and Chothia,C. (1995) Gene duplications in H. influenzae. Nature, 378, 140.
3 Wolf,Y.I., Grishin,N.V. and Koonin,E.V. (2000) Estimating the number of protein folds and families from complete genome data. J. Mol. Biol., 299, 897905.
4 The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 20122018.
5 Berman,H.,M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242.
6 Laskowski,R.A., Hutchinson,E.G., Michie. A.D., Wallace,A.C., Jones,M.L. and Thornton,J.M. (1997) PDBsum: a web-based database of summaries and analyses of all PDB structures. Trends Biochem. Sci., 22, 488490.
7 Wang,Y., Addess,K.J., Geer,L., Madej,T., Marchler-Bauer,A., Zimmernan,D. and Bryant,S.H. (2000) MMDB: 3D structure data in Entrez. Nucleic Acids Res., 28, 243245.
8 Ball,C.A., Dolinski,K., Dwight,S.S., Harris,M.A., Issel-Tarver,L., Kasarskis,A., Scafe,C.R., Sherlock,G., Binkley,G., Jin,H., Kaloper,M., Orr,S.D., Schroeder,M., Weng,S., Zhu,Y., Botstein,D. and Cherry,J.M. (2000) Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res., 28, 7780.
9 Frishman,D., Heumann,K., Lesk,A. and Mewes,H.W. (1998) Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformatics, 14, 551561.
10 The FlyBase Consortium. (1999) The FlyBase database of the Drosophila Genome Projects and community literature. Nucleic Acids Res., 27, 8588.
11 Tatusov,R.L., Galperin,M.Y., Natale,D.A. and Koonin,E.V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28, 3336.
12 Aach,J., Rindone,W. and Church,G.M. (2000) Systematic management and analysis of yeast gene expression data. Genome Res., 10, 431445.
13 Bader,G.D. and Hogue,C.W. (2000) BINDa data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics, 16, 465477.
14 Xenarios,I., Rice,D.W., Salwinski,L., Baron,M.K., Marcotte,E.M. and Eisenberg,D. (2000) DIP: the database of interacting proteins. Nucleic Acids Res., 28, 289291.
15 Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank Nucleic Acids Res., 28, 1518.
16 Konopka,A.K. and Martindale,C. (1995) Noncoding DNA, Zipfs law, and language. Science, 268, 789.
17 Flam,F. (1994) Hints of a language in junk DNA. Science, 266, 1320.
18 Bornberg-Bauer,E. (1997) How are model protein structures distributed in sequence space? Biophys. J., 73, 23932403.
19 Gerstein,M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33, 518534.
20 Gerstein,M. (1997) A structural census of genomes: comparing eukaryotic, bacterial and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562576.
21 Jeong,H., Tombor,B., Albert,R., Oltvai,Z.N. and Barabasi,A.L. (2000) The large-scale organization of metabolic networks. Nature, 407, 651654.
22 Amaral,L.A.N., Scala,A., Barthelemy,M. and Stanley,H.E. (2000) Classes of small-world networks Proc. Natl Acad. Sci. USA, 97, 1114911152.
23 Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540.
24 Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATHa hierarchic classification of protein domain structures. Structures, 5, 10931108.
25 Holm,L. and Sander,C. (1996) Mapping the protein universe. Science, 273, 595602.
26 Gibrat,J.F., Madej,T. and Bryant,S.H. (1996) Surprising similarities in structure comparison. Curr. Opin. Struct. Biol., 6, 337385.
27 Madej,T., Gibrat,J.-F. and Bryant,S.H. (1995) Threading a database of protein cores. Proteins, 23, 356369.
28 Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L.L. (1999) The Pfam protein families database. Nucleic Acids Res., 27, 260262.
29 Henikoff,J.G., Greene,E.A., Pietrokovski,S. and Henikoff,S. (2000) Increased coverage of protein families with the blocks database servers. Nucleic Acids Res., 28, 228230.
30 Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA, 95, 58575864.
31 Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res., 28, 254256.
32 Lipman,D.J. and Pearson,W.R. (1985) Rapid and sensitive protein similarity searches. Science, 227, 14351441.
33 Altschul,S.F. and Koonin,E.V. (1998) Iterated profile searches with PSI-BLASTa tool for discovery in protein databases. Trends Biochem. Sci., 23, 444447.
34 Brenner,S., Chothia,C. and Hubbard,T. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 60736078.
35 Gerstein,M. and Levitt,M. (1997) A structural census of the current population of protein sequences. Proc. Natl Acad. Sci. USA, 94, 1191111916.
36 Teichmann,S., Chothia,C. and Gerstein,M. (1999) Advances in structural genomics. Curr. Opin. Struct. Biol., 9, 390399.
37 Gerstein,M., Lin,J. and Hegyi,H. (2000) Protein folds in the worm genome. Pac. Symp. Biocomput., 5, 3042.
38 Lin,J. and Gerstein,M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res., 10, 808818.
39 Wilson,C.A., Kreychman,J. and Gerstein,M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol., 297, 233249.
40 Levitt,M. and Gerstein,M. (1998) A unified statistical framework for sequence comparison and structure comparison. Proc. Natl Acad. Sci. USA, 95, 59135920.
41 Gerstein,M. and Levitt,M. (1998) Comprehensive assessment of automatic structural alignment against a manual sandard, the Scop classification of proteins. Protein Sci., 7, 445456.
42 Gerstein,M. (1998) How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des., 3, 497512.
43 Velculescu,V.E., Zhang,L., Zhou,W., Vogelstein,J., Basrai,M.A., Bassett,D.E.,Jr, Hieter,P., Vogelstein,B. and Kinzler,K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243251.
44 Brown,P.O. and Botstein,D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet., 21, 3337.
45 Lipshutz,R.J., Fodor,S.P., Gingeras,T.R. and Lockhart,D.J. (1999) High density synthetic oligonucleotide arrays. Nat. Genet., 21, 2024.
46 Jansen,R. and Gerstein,M. (2000) Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res., 28, 14811488.
47 Gerstein,M. and Jansen,R. (2000) The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function Curr. Opin. Struct. Biol., 10, 574584.
48 Jelinsky,S.A. and Samson,L.D. (1999) Global response of Saccharomyces cerevisiae to an alkylating agent. Proc. Natl Acad. Sci. USA., 96, 14861491.
49 Holstege,F.C., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengartner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell, 95, 717728.
50 Roth,F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939945.
51 Park,J., Lappe,M. and Teichmann,S.A. (2001) Mapping protein family interactions: intra- and intermolecular interactions repertoires are distinct. J. Mol. Biol., 307, 929939.
52 Teichmann,S., Chothia,C., Church,G. and Park,J. (2000) Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics, 16, 117124.
53 Teichmann,S.A., Park,J. and Chothia,C. (1998) Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc. Natl Acad. Sci. USA, 95, 1465814663.
54 Uetz,P., Giot,L., Cagney,G., Mansfield,T.A., Judson,R.S., Knight,J.R., Lockshon,D., Narayan,V., Srinivasan,M., Pochart,P., Qureshi-Emili,A., Li,Y., Godwin,B., Conover,D., Kalbfleisch,T., Vijayadamodar,G., Yang,M., Johnston,M., Fields,S. and Rothberg,J.M. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627.
55 Ito,T., Tashiro,K., Muta,S., Ozawa,R., Chiba,T., Nishizawa,M., Yamamoto,K., Kuhara,S. and Sakaki,Y. (2000) Toward a proteinprotein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl Acad. Sci. USA, 97, 11431147.
56 Gerstein,M. and Krebs,W. (1998) A database of macromolecular motions. Nucleic Acids Res., 26, 42804290.
57 Krebs,W. and Gerstein,M. (2000) The morph server: a standardized system for analyzing and visualizing macromolecular motions in a database framework. Nucleic Acids Res., 28, 16651675.
58 Ross-Macdonald,P., Coelho,P.S., Roemer,T., Agarwal,S., Kumar,A., Jansen,R., Cheung,K., Sheehan,A., Symoniatis,D., Umansky,L., Heidtman,M., Nelson,F.K., Iwasaki,H., Hager,K., Gerstein,M., Miller,P., Roeder,G.S. and Snyder,M. (1999) Large-scale analysis of the yeast genome by transposon tagging and gene disruption. Nature, 402, 413418.
59 Harrison,P., Echols,N. and Gerstein,M. (2001) Digging for dead genes: an analysis of the characteristics of the pseudogene population in the C. elegans genome. Nucleic Acids Res., 29, 818830.
60 Hegyi,H. and Gerstein,M. (1999) The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol., 228, 147164.
61 Schwikowski,B., Uetz,P. and Fields,S. (2000) A network of proteinprotein interactions in yeast. Nat. Biotechnol., 18, 12571261.
62 Knuth,D. (1973) The Art of Computer Programming 3. Addison-Wesley, Reading, MA.
63 Bairoch,A. (1993) The ENZYME data bank. Nucleic Acids Res., 21, 31553156.
64 Riley,M. and Labedan,B. (1996) E. coli gene products: physiological functions and common ancestries. In Neidhardt,F., Curtiss,R.,III, Lin,E.C.C., Ingraham,J., Low,K.B., Magasanik,B., Reznikoff,W., Riley,M., Schaechter,M. and Umbarger,H.E. (eds), Escherichia coli and Salmonella: Cellular and Molecular Biology. ASM Press, Washington, DC, pp. 21182202.
65 Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T., Harris,M.A., Hill,D.P., Issel-Tarver,L., Kasarskis,A., Lewis,S., Matese,J.C., Richardson,J.E., Ringwald,M., Rubin,G.M. and Sherlock,G. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 2529.
66 Schmidt,R.B., Gerstein,M. and Altman,R.B. (1997) LPFC: an internet library of protein family core structures. Protein Sci., 6, 246248.
67 Drawid,A., Jansen,R. and Gerstein,M. (2000) Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet., 16, 426429.
68 Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 12011210.
69 Spellman,P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K. Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297.
70 DeRisi,J.L., Iyer,V.R. and Brown P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680686.
71 Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699705.
72 Richmond,C.S., Glasner,J.D., Mau,R., Jin,H. and Blattner,F.R. (1999) Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res., 27, 38213835.
73 Wixon,J., Blaxter,M., Hope,I., Barstead,R. and Kim,S. (2000) Caenorhabditis elegans. Yeast, 17, 3742.
Abstract of this Article
Reprint (PDF) Version of this Article
Similar articles found in:
Nucl. Acids. Res. Online
PubMed
PubMed Citation
Search Medline for articles by:
Qian, J.
||
Gerstein, M.
Alert me when:
new articles cite this article
Download to Citation Manager
HOME
HELP
FEEDBACK
SUBSCRIPTIONS
ARCHIVE
SEARCH
TABLE OF CONTENTS