Proteomics: Net Profit from Integrating Interactomes

Mark Gerstein, Ning Lan & Ronald Jansen

Molecular Biophysics & Biochemistry Dept., Yale University, 266 Whitney Ave., New Haven, CT 06520.

Email: Mark.Gerstein@yale.edu

With genome sequences as intellectual inspiration and practical scaffolds, scientists are increasingly performing experiments on all genes in a genome. [HN1] The integration of the resulting genome-wide information into useful definitions of gene function constitutes a principle challenge for post-genomic biology. Exactly what form such definitions will take is still an open question. Comprehensive networks of protein-protein interactions, interactomes, provide a valuable way of circumscribing and partially defining gene function. [HN2]

On page ??? of this issue, Tong et al. (1) describe a systematic approach for identifying interaction networks for peptide-recognition domains. They break new ground in the way they combine “orthogonal” datasets. Specifically, they intersect two different interactomes. The first is derived from screening phage-display peptide libraries to find consensus sequences that bind particular peptide-recognition domains. The resulting network connects proteins containing recognition domains to those containing the consensus. It partially defines binding sites in some of the proteins and represents a novel use of phage-display technology. The second network comes from experimentally testing each peptide-recognition module for association with possible protein-binding partners using the two-hybrid approach. [HN3] Tong et al. apply their approach to determine interacting partners for SH3 domains in yeast. [HN4] These domains make good targets because of their prevalence and involvement in a number of important biological processes, from cytoskeleton reorganization to signal transduction.

The power of Tong et al.'s approach is manifest in interpreting large genomic datasets, particularly in reducing their noise. One of the fallacies in dealing with genomic datasets is ascribing too much meaning to individual datapoints. Many datasets (e.g. expression) contain so much noise that it can be difficult to draw reliable conclusions for specific genes. However, they still have much useful information statistically, in terms of broad trends. To get at this, one must aggregate data. This can be simply achieved by combining replicates of an experiment, but this does not remove systematic errors. Secondly, in somewhat more elaborate fashion, one can collect many individual measurements on different proteins into aggregate "proteomic classes" (e.g. functional categories) and compare these (2). [HN5] Finally, Tong et al.'s study points to the perhaps most powerful approach: interrelating and integrating orthogonal (i.e. fundamentally different) information. In the abstract one can easily demonstrate that combining independent datasets results in a lower error rate overall. For instance, combining three independent binary-type datasets with error rates of 10% (for false positives and negatives) reduces the overall error rate to 2.8% (for positives and negatives) (5). [HN6] Moreover, interrelating two different types of whole-genome data also enables one to discover non-obvious and potentially significant relationships -- e.g. between expression and chromosomal positioning or subcellular localization (3).

There have been a number of attempts at interrelating information from different genomic datasets before, much of it starting from expression experiments. For instance, expression data were initially analyzed by a variety of supervised and unsupervised methods (e.g., hierarchical trees, k-means, self-organizing maps, and support-vector machines) and compared to functional categories (4). [HN5] They were also interrelated with transcription-factor binding sites, protein families, protein-protein interactions, and protein abundance (2,5,6) [HN7]. In a shorthand sense, much of this can be thought of as interrelating the transcriptome with other "omes" -- e.g. proteome, translatome, secretome, and interactome (2) [HN8].

There are considerably fewer examples of the synthesis of more than two types of whole-genome data [HN9]. One initial attempt combined expression correlations, phylogenetic profiles and patterns of domain fusion to predict protein function (7). A Bayesian framework integrated expression, essentiality, and sequence motif data for the prediction of protein subcellular localization (8). Tong et al.'s strategy of overlapping interactomes presents a new type of synthesis. It is particularly effective in that their two datasets are orthogonal in many respects. Phage-display is based on in vitro binding of short peptides, whereas two-hybrid uses in vivo binding and full-length proteins. Moreover, the phage-display network is computationally predicted but uses relatively unambiguous consensus sequences, whereas the two-hybrid one is experimentally derived but suffers from appreciable false positives (9).

From a data-mining standpoint, the heterogeneous character and variable quality of whole-genome information makes integration tricky. Consider combining "orthogonal" interactome datasets, such as attempted by Tong et al., in a general sense. How might one proceed formally? As indicated in the top figure, there are two extremes. On one, the datasets have low false-negative but high false-positive error rates. That is, each experiment almost never misses real interactions but also finds many spurious ones. In this situation the benefit of integration comes from intersection: only interactions common to all are accepted, thus lowering the combined error rate. Tong et al.'s approach fits this to some degree. On the other extreme are datasets with few false positives but low coverage of the space of interactions. The benefit of integration then comes from the union: any interaction found in at least one dataset is accepted. Another earlier interactome analysis followed this to some degree (10). [HN10]

In most practical situations, the optimal way to integrate datasets is somewhere between these extremes. The task is to combine datasets with varying error rates and coverage. Accordingly, the rules for identifying positives become more complicated. Instead of simple unions or intersections, different combinations of positive and negative signals from the datasets should be considered, taking into account their relative false positive and negative rates.

The bottom figure provides a practical illustration of the power of interrelating genomic data for yeast. It shows the degree to which one can find protein-protein associations in known protein complexes (11) based on integrating, in stepwise fashion, increasing amounts of orthogonal genomic information. We start by considering associations that can be found from expression correlations over the cell cycle; then we incorporate those derived from a second but different microarray experiment, giving responses to knockouts (12). Finally, we add associations predicted from genomic measurements of essentiality and localization (8,11,13). As we integrate more information, the total number of correctly identified interactions rises (especially for the union of the predicted associations) [HN11]. Simultaneously, the error rate decreases. Moreover, if we focus just on the intersection of the predicted associations, the error rate falls even more.

A major future challenge will be devising uniform frameworks for integrating information, both from high-throughput and traditional biochemical approaches. One aspect of this will be developing better databases for storing and querying heterogeneous information. In particular, databases will need to be more precise in their treatment of errors and also interface better with journals. Another aspect will be developing datamining approaches to operate on these databases, integrating many different genomic features into results pertinent to biological function. Genomic features can be of very different character (from hundreds of "Booleans" for interactions to tens of thousands of real-number vectors for expression timecourses), and a central issue in integration is determining how to relatively weight each feature. In this regard, some machine-learning techniques, such as Bayesian networks and decision trees, are quite powerful while others are more problematic (e.g. support-vector machines).

Finally, we will also need to come up with a more systematic definition of gene function, the ultimate aim of proteomic investigation. To many scientists, what constitutes "function" is a natural-language phrase or name, often in non-systematic terminology -- e.g. "ATPase" or "suppressor of white apricot" [HN12]. This approach is sufficient for single-molecule work but does not scale to the genomic level. More systematic attempts have been made to place proteins within a hierarchy of standard functional categories or to connect them into overlapping networks of varying types of association (11,14) [HN13]. These networks can obviously include protein-protein interactions, the subject of Tong et al.'s work. More broadly, they can include pathways, regulatory systems, and signaling cascades. How far are we able to go with this approach? Perhaps in the future the systematic combination of networks may provide for a truly rigorous definition of protein function [HN14].

Overlapping Nets.

(Top) Two different extremes in integrating interactomes. On left, the combined network is the union of those with low false-positive but high false-negative rates, whereas the combined network on right is the intersection of ones with low false-negative but high false-positive rates. Circles represent proteins; links, interactions; and dotted lines, known associations. (Thicker links indicate lower false-positive rates.) More effective rules for combining networks than union and intersection take into account the different error rates associated with each link type. (Bottom) How integrating progressively more orthogonal information identifies increasingly more associations (5). From the known complexes in yeast there are 8,250 protein-protein associations (11). The y-axis shows the percentage of these identified by disparate genomic data (i.e. coverage). The x-axis shows the progressive addition of genomic data. The first two bars represent the protein associations with most significant expression correlation in two different microarray sets (12). The next two represent adding the associations predicted because both proteins were similarly essential for cell survival or had similar localization (8,13). The shading on the bars roughly indicates false-positive rates throughout the integration. While it is reasonable that associated components of complexes will have correlated expression and similar localization and "essentiality", this is only weakly predictive, generating many spurious positives. Consequently, the "weak-links" case on the top-right applies, and one can see from the shading how intersection lowers the error rate.

References

1. A. Tong et al., Science Vol(?), page(?) (2002). (With interactions available from www.binddb.org.)

2. D. Greenbaum et al., Genome Res 11, 1463 (2001). R. Jansen, M. Gerstein, Nucleic Acids Res 28, 1481 (2000).

3. B. A. Cohen et al. Nat Genet 26, 183. (2000). A. Drawid et al. Trends Genet 16, 426. (2000).

4. P. Tamayo et al., Proc Natl Acad Sci U S A 96, 2907. (1999). M. B. Eisen et al. Proc Natl Acad Sci U S A 95, 14863. (1998). S. Tavazoie et al., Nat Genet 22, 281. (1999). M. Gerstein, R. Jansen, Curr Opin Struct Biol 10, 574. (2000). M. Brown et al., Proc Natl Acad Sci U S A 97, 262. (2000).

5. J. Qian et al. J Mol Biol 314, 1053 (2001). R. Jansen et al., Genome Res. 12, 37 (2002). (These contain details of the approach in figure and calculations with more at genecensus.org/integrate/interactions.)

6. H. Ge et al., Nat Genet 29, 482. (2001). F. Roth et al., Nat Biotechnol 16, 939. (1998). S. Gygi et al., Mol Cell Biol 19, 1720. (1999). B. Futcher et al., Mol Cell Biol 19, 7357. (1999). Brazma et al., Genome Res. 8, 1202 (1998).

7. A. Drawid, M. Gerstein, J Mol Biol 301, 1059. (2000).

8. E. Marcotte et al., Science 285, 751. (1999). E. Marcotte et al., Nature 402, 83. (1999).

9. T. Ito et al., Proc Natl Acad Sci U S A 98, 4569. (2001). P. Uetz et al., Nature 403, 623. (2000).

10. Schwikowski et al. Nat Biotechnol 18, 1257 (2000).

11. H. W. Mewes, et al., Nucleic Acids Res 28, 37-40. (2000).

12. R. J. Cho et al., Mol Cell 2, 65-73. (1998). T. R. Hughes et al., Cell 102, 109-26. (2000).

13. E. A. Winzeler et al., Science 285, 901-6. (1999). P. Ross-Macdonald et al., Nature 402, 413-8. (1999).

14. D. Eisenberg et al., Nature 405, 823. (2000). M. Ashburner et al., Nat Genet 25, 25. (2000).

General Hypernotes

Focusing on Functional genomics

Michael B. Eisen at Berkeley, microarray analysis.

Church Lab at Harvard University, quantitative whole genome and proteome measures in computational modeling of regulatory and enzymatic networks.

Michael Bittner at NHRGI, hybridization-based analytical tools.

Tim Hughes at the University of Toronto, microarray expression analysis

Philip Green at Department of Genome Sciences, University of Washington, computational methods in functional genomics.

Aebersold Protein Laboratory at University of Washington, quantitative protein expression profiles.

Richard Young at MIT, Regulation of Gene Expression.

Michael Q. Zhang at Cold Spring Harbor Laboratory, a series of prediction tools.

Lincoln Stein Laboratory at Cold Spring Harbor Laboratory, various softwares for computational biology and genome bioinformatics.

Focusing on Structural genomics

Ruben Abagyan at the Scripps Research Institute, molecular modeling, bioinformatics and drug design & flexible docking.

Burkhard Rost at Columbia University, protein structure.

David Baker at University of Washington, Protein folding.

Helen Berman, president of American Crystallographic Association, protein structure and fuinction.

Phil Bourne Senior Principal Scientist at the San Diego Supercomputer Center, structural bioinformatics.

Stephen H. Bryant at NCBI, Comparative macromolecular 3D-structure analysis.

Fred Cohen at Cellular & Molecular Pharmacology Department at the UCSF, protein folding.

Daniel Fischer at Ben Gurion University, Computational Structural Molecular Biology.

Richard A. Friesner at Columbia University, novel ab initio electronic structure methods.

Christopher Hogue at University of Toronto, Structural Bioinformatics.

Liisa Holm at Structural Genomics Group at EMBL-EBI, protein structure.

Barry Honig at Columbia University, computational biophysics and bioinformatics.

Sung-Hou Kim at Department of Chemistry of UC Berkeley , Structural Genomics.

Irwin D. Kuntz at UCSF, structure based ligand design and docking algorithms.

Guy Montelione at Rutgers, Protein NMR, structural genomics, proteomics, structural bioinformatics.

Janet Thornton at UCL, structural proteomics.

Eugene Shakhnovich at Harvard University, protein folding and de novo drug design.

Harold A. Scheraga at Cornell, protein folding.

Andrej Sali Lab at Rockefeller University, Protein Structure Theory.

George Rose at Johns Hopkins University, protein folding and structure.

J. Andrew McCammon at UCSD, theoretical chemistry and biochemistry.

Benoit Roux at Cornell, structure and function of ion channels.

Harel Weinstein at Mount Sinai School of medicine, Molecular recognition and signal transduction.

Focusing on the Development of Computational algorithms

Bonnie Berger head of the Computation and Biology group at MIT, mathematical techniques in protein folding and genomics.

Mark Borodovsky at Georgia Institute of Technology, machine learning algorithms and biomolecular structure-function relationships

B. S. Weir, Director of the Bioinformatics Research Center at North Carolina State University, statistical methodology for the interpretation of genetic data.

Edward H. Shortliffe at Columbia University, integrated decision-support systems in medical informatics.

Jun Liu at Department of Statistics, Harvard University, statistical methodology.

Gary Churchill at the Jackson Laboratory, Statistical Genetics.

Jonathan A. Eisen at TIGR, microbial genomics.

Imran Shah at Department of Computer Science & Engineering, University of Colorado at Denver, bioinformatic methods for integrating complex biological data.

Jurg Ott at Department of Genetics and Development, mathematical-statistical methods for human gene mapping.

Michael A. Newton, Department of Statistics and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, empirical Bayesian methods in expression analysis.

Steven Salzberg, director of Bioinformatics, The Institute for Genomic Research, microbial genomics and machine learning.

Casimir Kulikowski at Department of Computer Science, Rutgers, knowledge-based problem solving methods, clustering, and predictive data mining for bioinformatics.

Mona Singh at MIT, machine learning and algorithms in computational biology.

Tamar Schlick at MIT, Mathematical Biology.

Kevin Karplus at UCSC compbio group , computational biology.

The Brutlag Bioinformatics Group at Biochemistry Department, Stanford University, prediction of structure and function from primary sequence.

Steven Benner at University of Florida, bioinformatics tools.

Chris Burge at MIT, RNA splicing.

Gary Stormo at Department of Genetics at Washington University School of Medicine, protein DNA interactions, RNA structure prediction

Larry Hunter at Center for Computational Pharmacology, University of Colorado at Denver, drug design through the use of high throughput molecular biology data.

Research centers

National Resource for Cell Analysis and Modeling- Home of the Virtual Cell.

Molecular Pattern Recognition Group at the Whitehead/MIT Center for Genome Research.

The Sanger Centre : Human Genome Project

Caltech Initiative in Computational Molecular Biology.

The Program in Computational Biology at Johns Hopkins University.

Computational Molecular Biology Group at Hebrew University.

Penn State - Bioinformatics Group

Program in Proteomics and Bioinformatics at the University of Toronto.

Theoretical & Computational Biology Group at MRC.

UPenn. Center for Bioinformatics

Virtual Genome Center at University of Minnesota.

Eric Lander, Genome Center Director at Whitehead Institute

George Weinstock, co-director of Human Genome Sequencing Center at Baylor College of Medicine, functional genomics and informatics.

Genomics Resources (particularly for yeast)

Complete S.cerevisiae Genome at NCBI.

YPD: Yeast Protein Database at Proteome, Inc.

SGD: Saccharomyces Genome Database at Stanford.

Sacch3D Structural Database at Stanford.

S. cerevisiae Promotor Database at Cold Spring Harbor.

Yeast Intron Sequences at University of California Santa Cruz.

Yeast Resource Center at the University of Washington.

Yeast Gene Duplications at Irish National Centre for BioInformatics.

The Saccharomyces Genome Deletion Project aims to generate a complete set of yeast deletion strains using a PCR-based gene deletion strategy, in order to assign function to the ORFs through phenotypic analysis of the mutants.

TRIPLES Database developed by Yale Center for Medical Informatics, provides public access to disruption phenotype, localization and expression data generated in the transposon-tagging study of the yeast genome.

Yeast Cluster Analysis Server provides graphic views of statistical data in yeast genome analysis.

Comparison of the worm and yeast genomes in terms of folds

Comprehensive reports on yeast orfome focuses on the probability of each ORF being a real gene.

Pfam, a large database of protein domain family alignments, annotation and profile HMMs, is lead by Alex Bateman at the Wellcome Trust Sanger Institute.

Cancer Genome Anatomy Project (CGAP), NCI program to study molecular changes in cancer.

Columbia University Microarray Project (CUMAP), advancing microarray technologies.

SPINE, tracking database for NESGC, and data mining approach for identifying feasible targets in high-throughput structural proteomics.

Courses and educational resources relevant to Bioinformatics & Genomics

General

A glossary of genetic terms is provided by the National Human Genome Research Institute (NHGRI).

Genetics Glossary from the UK -- http://helios.bto.ed.ac.uk/bto/glossary/

Online Courses at the International Society for Computational Biology.

Biology Hypertextbook at MIT.

2001 Primer on Molecular Genetics from the U.S. Department of Energy.

Bioinformatics FAQ at Harvard Research Computing Center.

M.Gerstein, Department of Molecular Biophysics & Biochemistry and Computer Science, and Yale Bioinformatics Research Group, Yale University, provides lecture notes for a course on Genomics & Bioinformatics.

An elementary, step-by step Genetics Tutorial at Rutgers University.

A gentle introduction course to Bioinformatics at University of Colorado at Denver.

ABIM Guides, Tutorials in Biology

Bioinformatics and Genomics course at UCLA.

Bioinformatics Applications Course at University of Georgia.

Computational genomics

Statistical Methods in Bioinformatics : An Introduction.

Computational Genomics, taught by William Noble Grundy at Columbia University.

Computational Techniques for Genomics course at University of Minnesota.

Data Mining Lecture Notes by Jeffrey Ullman at Stanford University.

Representations and Algorithms for Computational Molecular Biology, taught by Russ Altman at Stanford University.

DIMACS Workshop on Integration of Diverse Biological Data.

Computational Molecular Biology course at Stanford University.

Genomics and Computational Biology, taught by George Church at Harvard University.

Algorithms for Molecular Biology, taught by Ron Shamir at Tel Aviv University.

Computational Molecular Biology at Brandeis University.

Topics in Genomics and Computational Biology, taught by Steven Brenner and Michael Eisen at Berkeley.

Computational Biology Seminar at University of Washington.

Bioinformatics tools

Sequence search and alignment

BLAST: a set of protein or DNA similarity search programs designed to explore all of the available sequence databases in Genbank.

FastA: compares a protein or DNA sequence to another sequence or a database.

HMMer: performs profile hidden Markov models(HMM) to do sensitive database searching using statistical descriptions of a sequence family's consensus.

SAM: Sequence Alignment and Modeling system, based on HMM and Dirichlet mixtures

CLUSTAL-W: a general purpose multiple alignment program for DNA or proteins.

Data Mining Tools

MiNK at Columbia Genome Center is part of the Columbia University Microarray Project.

Online Machine Learning Resources, maintained by the ML Group at the Austrian Research Institute for Artificial Intelligence (OFAI)

Bayesware Limited, knowledge discovery and data mining software based on Bayesian methods.

Software Packages for Graphical Models / Bayesian Networks, a list of softwares with a guide to their applications.

SVM-Light, an implementation of Support Vector Machine, developed by Thorsten Joachims at Cornell University.

svm: a simple software system for performing support vector machine binary classification, developed by William Noble Grundy.

Other

Entrez Utilities, a series of programs that retrieve information from NCBI databases.

PubMed Central Home, a digital archive of life sciences journal literature.

A Distributed Annotation System, center of development of an Open Source system for exchanging annotations on genomic sequence data.

bioxml.org, a center of development for open source biological dtds.

DNA Structural Analysis Sequenced Genomes, a DNA Structural Atlas developed at Center for Biological Sequence Analysis, Technical University of Denmark.

euGenes, a common summary of gene and genomic information from eukaryotic organism databases.

The Phred/Phrap/Consed System, for sequence search and shotgun sequence assembly.

PHYLIP, free package of programs for inferring phylogenies.

ProtoMap - automatic hierarchical classification of proteins

Repeat Masker Server, screens DNA sequences for interspersed repeats and low complexity DNA sequences.

Numbered Hypernotes

As of December 2001, GenBank repository of nucleic acid sequences contained over 800 whole genome sequences. Structural Genomics Initiatives aim at large-scale determination of protein structures. Function genomics elucidate function of all the proteins in a genome, typically through mRNA expression, protein interaction assays, gene disruption or proteome microarray.
The Biomolecular Interaction Network Database (BIND) is a database designed to store full descriptions of interactions, molecular complexes and pathways. As of December 2001, it stored records of 5940 interactions, 54 complexes and 7 pathways. The C.elegans interactome project lea by Mark Vidal at Harvard University aims to generate a comprehensive protein-protein interaction map for C. elegans. AxCell's ProChart^(TM) database is currently one of the most comprehensive databases of human protein interactions. The Proteomic Pathway Project at BIOCARTA provides dynamic graphic models for gene interaction. Kohn Molecular Interaction Maps designed by Kurt W. Kohn is a molecular interaction map of the mammalian cell cycle control and DNA repair systems and amenable for further development.
The Y2H system was first established by Song and Fields (1989). Brent’s lab at Harvard University has developed a nice variant of the system and Elledge's lab at Baylor College of Medicine has contributed a lot to this technology.
The SH3 domain is one of the best-characterized members of the growing family of protein-interaction modules. The basic fold contains five anti-parallel b-strands packed to form two perpendicular b-sheets. It is catalogued as entry IPR001452 in Interpro.
Standard methods for expression data include hierarchical clustering, k-means clustering, self-organizing maps and principle component analysis. Inge Jonassen's web site provides references on these methods. Expression analysis software has been developed using various combinations of these methods, such as Cluster and TreeView, EPCLUS, CLEAVER, etc. One new combination method is local clustering, which finds time-shifted and inverted relationships. Expression databases include GEO (NCBI gene expression omnibus), ExpressDB (Harvard), GeneX (NCGR), Stanford Microarray Database and ArrayExpress (EBI). A comprehensive list of links to expression databases and softwares is provided at European Bioinformatics Institute (EBI).
This estimate is built from a simple Bayesian network model of three binary-type datasources, with random noise. An example of binary-type data is protein-protein interactions ("YES, there is an interaction" or "NO, there isn't"). We give a detailed discussion the calculation with a spreadsheet at genecensus.org/integrate/interactions .
There have been efforts to correlate expression data with transcriptional regulation, functional categories, protein folds and families, and protein abundance; map between transcriptome and interactome, integrate sequence and expression data to predict subcellular localization, and combine whole-genome data to infer functional linkages of proteins. GenMAPP at UCSF was designed to visualize gene expression data on maps representing biological pathways. Biomolecular Relations in Information Transmission and Expression (BRITE) is a database of binary relations involving genes and proteins, including interactions that underlie the KEGG pathway diagrams, protein-protein interactions by Y2H systems, sequence similarity relations by SSEARCH and expression similarity relations by microarray gene expression profiles.
A myriad of terms suffixed with “ome” have been coined to define the varied cellular populations and subpopulations, ranging from the most well-known Genome, which first appeared in Pubmed in 1932, to the newly coined terms that have yet to gain wide recognition, such as Foldome or Transportome.
Eisenberg et al.defined functional protein networks combining experimentally determined interactions from the DIP with those computationally predicted from protein phylogenetic profiles and expression data. The union of various datasets was counted while higher reliability was given to interactions predicted by two or more methods. The Allfuse Database developed by the Computational Genomics Group at EBI aims to find functional associations of proteins in complete genomes though gene fusion.
Schwikowski et al.obtained a single large network from the union of recorded interactions in MIPS and the Database of Interacting Proteins (DIP), and data from Y2H experiments carried out by Ito et al. and Uetz et al, and thereby computed a map of interactions between functional groups.
Further details of this graph are available at genecensus.org/integrate/interactions (alternate site). Four datasets were integrated; (i) Yeast Cell Cycle Data (timecourse expression data); (ii) Rosetta knockout data at ExpressDB (non-timecourse expression data); (iii) Essentiality data comes from MIPS catalogs and (in conditional form) from Michael Snyder's phenoytpe-profiling transposon experiments; (iv) Localization data comes from a localization website (set #4), which merges data from MIPS, Swissprot, and the Snyder lab. YPD also has lots of localization data.
The name of genes which were given by their discoverers may not be related to their function, and sometimes can be misleading.
Hierarchical system to represent protein function for single organisms include MIPS functional classification catalog (yeast), GenProtEC (E. coli), FlyBase (Drosophila) and EGAD (human ESTs). Gene Ontology is a common source where functional classification for multiple organisms merged. The Enzyme system classifies enzyme function. MetaCyc: Metabolic Encyclopedia, WIT and KEGG describe gene function in terms of metabolic pathways and reactions. Protein function can also be inferred from interaction maps available at DIP, Pronet Database and the Biomolecular Interaction Network Database.
There are many groups that are currently working on inferring and defining protein function. Some representative sites: Steven E. Brenner at Berkeley; Janet Thornton at UCL; a pioneer in structure-function relationships, an integrated function site under construction at Yale, Barry Honig at Columbia; Andrey Rzhetsky and Bill Noble (with Paul Pavlidis) at the Columbia Genome Center; David Eisenberg at UCLA; and Eugene V. Koonin at the NCBI.