Science -- Gerstein et al. 295 (5553): 284

Institution: Yale University | Sign In as Individual | FAQ

		Summary of this Article
		Reprint (PDF) Version of this Article
		dEbates: Submit a response to this article
		Related articles in Science
		Similar articles found in: SCIENCE Online PubMed
		PubMed Citation
		Search Medline for articles by: Gerstein, M. \|\| Jansen, R.
		Alert me when: new articles cite this article
		Download to Citation Manager

		Collections under which this article appears: Enhanced Content
		Genetics

Also see the archival list of Science's Compass: Enhanced Perspectives

PROTEOMICS:
Enhanced: Integrating Interactomes

Mark Gerstein, Ning Lan, Ronald Jansen [HN19]^*

With the human genome sequence as an intellectual inspiration and practical scaffold, scientists are ready to perform experiments on all genes [HN1]. Integrating the resulting genomewide information into useful definitions of protein function is a huge challenge. Exactly what form such functional definitions will take is still debatable, but comprehensive networks of protein-protein interactions, or interactomes, should prove valuable in helping to shape them [HN2].

On page 321 of this issue, Tong et al. [HN3] (1, 2) describe a systematic approach for identifying protein-protein interaction networks in which different peptide recognition domains participate. They break new ground in the way they combine "orthogonal" (that is, fundamentally different) sets of genomic information. Specifically, they study the intersection of two different interactomes. The first is derived from screening phage-display peptide libraries to find consensus sequences in yeast proteins that bind to particular peptide recognition domains. The resulting network connects proteins with recognition domains to those containing the consensus. This network partially defines binding sites in some of the proteins and represents a clever use of phage display technology [HN4]. The second network is derived from experimentally testing each peptide recognition module, using the yeast two-hybrid technique [HN5], for association with possible protein-binding partners. Tong et al. apply their approach to determine interacting partners for SH3 domains [HN6] in yeast proteins. These domains make good targets because of their prevalence and involvement in a number of important biological processes, from cytoskeleton reorganization to signal transduction.

The power of Tong et al.'s strategy, particularly for reducing noise, becomes manifest when interpreting large genomic data sets. One fallacy in dealing with genomic data sets is ascribing too much meaning to individual data points. Many data sets (for example, gene expression profiles) contain so much noise that it can be difficult to draw reliable conclusions for specific genes. These data sets still offer much useful information statistically, in terms of broad trends, but they are useful only insofar as the data can be aggregated. This can be simply achieved by combining replicates of an experiment, but such a process does not remove systematic errors. It is also possible to collect many individual measurements on different proteins into aggregate "proteomic classes," for example, functional categories, and to compare these [HN7] (3-6).

The new work points to perhaps the most powerful approach: interrelating and integrating orthogonal information. In the abstract, it is easy to demonstrate that combining independent data sets results in a lower error rate overall. For instance, combining three independent binary-type data sets with error rates of 10% reduces the overall error rate to 2.8% (for both false positives and negatives) [HN8] (7). Moreover, interrelating two different types of whole-genome data also enables one to discover potentially important but not obvious relationships--for example, between gene expression and the position of genes on chromosomes, or between gene expression and the subcellular localization of proteins (8, 9).

There have been a number of previous attempts to interrelate information from different genomic data sets. For instance, gene expression profiles were initially analyzed by a variety of supervised and unsupervised methods--hierarchical trees, k-means, self-organizing maps, and support-vector machines--and compared with protein function categories (10-14). Gene expression data were also compared with data sets describing transcription factor binding sites, protein families, protein-protein interactions, and protein abundance [HN9] (3-6, 15-20). In a shorthand sense, much of this can be thought of as interrelating the transcriptome (population of mRNA transcripts) with other "omes" such as the proteome, translatome, secretome, and interactome [HN10] (3).

There are considerably fewer examples of the synthesis of more than two types of genomic information [HN11]. One initial attempt combined gene expression correlations, phylogenetic profiles, and patterns of domain fusion to predict protein function (21, 22). Bayesian statistics were used to integrate gene expression, "essentiality" (the degree to which a gene is essential for survival), and sequence motif data into a uniform framework for the prediction of protein subcellular localization (20). Tong et al.'s strategy of overlapping interactomes presents a new type of synthesis. It is particularly effective in that their two data sets are orthogonal in many respects. Phage display is based on in vitro binding of short peptides, whereas the two-hybrid approach assays in vivo binding between full-length proteins. Moreover, the phage display network is computationally predicted but uses relatively unambiguous consensus sequences, whereas the two-hybrid network is experimentally derived but suffers from appreciable false positives (23, 24).

From a data-mining standpoint, the heterogeneous character and variable quality of whole-genome information makes integration tricky. Consider combining "orthogonal" interactome data sets, such as attempted by Tong et al., in a general sense. How might one proceed formally? There are two extremes (see the figure, below). At one extreme, the data sets have low false-negative but high false-positive error rates. That is, each experiment almost never misses real interactions but also finds many spurious ones. In this situation, the benefit of integration comes from intersection: Only interactions common to all are accepted, thus lowering the combined error rate. Tong et al.'s approach fits this to some degree. At the other extreme are data sets with few false positives but low coverage of the space of interactions. The benefit of integration then comes from the union: Any interaction found in at least one data set is accepted. An earlier interactome analysis followed this to some degree [HN12] (25).

Overlapping nets. Two different extremes in integrating interactomes. The combined network on the left is the union of those interactomes with low false-positive but high false-negative rates, whereas the combined network on the right is the intersection of interactomes with low false-negative but high false-positive rates. Circles represent proteins; links, interactions; and dotted lines, known associations. Thicker links indicate lower false-positive rates. More effective rules for combining networks than union and intersection take into account the different error rates associated with each link type.

In most practical situations, the optimal way to integrate data sets is somewhere between these extremes. The task is to combine data sets with varying error rates and coverage. Accordingly, the rules for identifying positives become more complicated. Instead of simple unions or intersections, different combinations of positive and negative signals from the data sets should be considered, taking into account their relative false-positive and -negative rates.

A practical illustration of the power of interrelating genomic data for yeast [HN13] (see the figure, below) shows the degree to which one can find protein-protein associations in known protein complexes (5, 6, 26) by stepwise integration of increasing amounts of orthogonal genomic information [HN14]. We start by considering associations that can be found from gene expression correlations over the cell cycle (27); then we incorporate those derived from a second but different microarray experiment, which provides a series of gene expression changes after specific genes have been knocked out (28). Finally, we add associations predicted from genomic measurements of essentiality and localization (20, 26, 29, 30). As we integrate more information, the total number of correctly identified interactions rises (especially for the union of the predicted associations). Simultaneously, the error rate decreases. Moreover, if we focus just on the intersection of the predicted associations, the error rate falls even more.

A net profit from integration. Integrating progressively more orthogonal information identifies more and more associations (5-7). From the known complexes in yeast, there are 8250 protein-protein associations (26). The y axis shows the percentage of these identified by disparate genomic data (that is, coverage). The x axis shows the progressive addition of genomic data. The first two bars represent the protein associations with the most significant expression correlation in two different microarray sets (27, 28). The next two represent adding the associations predicted because both proteins were similarly essential for cell survival ("essentiality") or had similar subcellular localization (20, 29, 30). The color shading on the bars roughly indicates false-positive rates throughout the integration. Although it is reasonable that associated components of complexes will have correlated expression and similar localization and "essentiality," this is only weakly predictive, generating many spurious positives. Consequently, the "weak links" case in the right hand panel of the previous figure mostly applies, and the shading indicates how intersection lowers the error rate.

A future challenge will be to devise uniform frameworks for integrating information from both high-throughput and traditional biochemical approaches. One aspect of this will be to develop better databases for storing and querying heterogeneous information. In particular, databases will need to be more precise in their treatment of errors and also interface better with the information in journals. Another aspect will be to develop data-mining strategies that can operate with these databases, integrating many different genomic features into results pertinent to biology. Genomic features can be of very different character (from hundreds of "Booleans" for interactions, to tens of thousands of real-number vectors for expression profiles), and a central issue in integration is determining how to weight each feature relative to the others. In this regard, some machine-learning techniques, such as Bayesian networks [HN15] and decision trees, are quite powerful, whereas others, for example, support-vector machines, are more problematic.

Finally, we will need to come up with a more systematic definition of gene function, the ultimate aim of proteomic investigation. To many scientists, what constitutes "function" is a phrase or name often in nonsystematic terminology, such as "ATPase" or "suppressor of white apricot [HN16]." Such descriptions are sufficient for single-molecule work but cannot be scaled up to the genomic level. More systematic attempts have been made to place proteins within a hierarchy of standard functional categories or to connect them in overlapping networks of varying types of association [HN17] (26, 31, 32). These networks can obviously include protein-protein interactions, the subject of Tong et al.'s work. More broadly, they can include pathways, regulatory systems, and signaling cascades. How far are we able to go with this network approach? Perhaps, in the future, the systematic combination of networks may provide for a truly rigorous definition of protein function [HN18].

References

A. H. Y. Tong et al., Science 295, 321 (2002); published online 13 December 2001 (10.1126/science.1064987).
Interaction data from Biomolecular Interaction Network Database (http://www.binddb.org/).
D. Greenbaum et al., Genome Res. 11, 1463 (2001) [Medline].
R. Jansen, M. Gerstein, Nucleic Acids Res. 28, 1481 (2000) [Medline].
J. Qian et al., J. Mol. Biol. 314, 1053 (2001) [Medline].
R. Jansen et al., Genome Res. 12, 37 (2002).
Details at http://genecensus.org/integrate/interactions.
B. A. Cohen et al., Nature Genet. 26, 183 (2000) [Medline].
A. Drawid et al., Trends Genet. 16, 426 (2000) [Medline].
P. Tamayo et al., Proc. Natl. Acad. Sci. U.S.A. 96, 2907 (1999) [Medline] [PNAS].
M. B. Eisen et al., Proc. Natl. Acad. Sci. U.S.A. 95, 14863 (1998) [Medline] [PNAS].
S. Tavazoie et al., Nature Genet. 22, 281 (1999) [Medline].
M. Gerstein, R. Jansen, Curr. Opin. Struct. Biol. 10, 574 (2000) [Medline].
M. Brown et al., Proc. Natl. Acad. Sci. U.S.A. 97, 262 (2000) [Medline] [PNAS].
H. Ge et al., Nature Genet. 29, 482 (2001) [Medline].
F. Roth et al., Nature Biotechnol. 16, 939 (1998) [Medline].
S. Gygi et al., Mol. Cell. Biol. 19, 1720 (1999) [Medline] [Full-text].
B. Futcher et al., Mol. Cell. Biol. 19, 7357 (1999) [Medline] [Full-text].
A. Brazma et al., Genome Res. 8, 1202 (1998) [Medline].
A. Drawid, M. Gerstein, J. Mol. Biol. 301, 1059 (2000) [Medline].
E. Marcotte et al., Science 285, 751 (1999).
E. Marcotte et al., Nature 402, 83 (1999) [Medline].
T. Ito et al., Proc. Natl. Acad. Sci. U.S.A. 98, 4569 (2001) [Medline] [PNAS].
P. Uetz et al., Nature 403, 623 (2000) [Medline].
B. Schwikowski et al., Nature Biotechnol. 18, 1257 (2000) [Medline].
H. W. Mewes et al., Nucleic Acids Res. 28, 37 (2000) [Medline].
R. J. Cho et al., Mol. Cell 2, 65 (1998) [Medline].
T. R. Hughes et al., Cell 102, 109 (2000) [Medline].
E. A. Winzeler et al., Science 285, 901 (1999).
P. Ross-Macdonald et al., Nature 402, 413 (1999) [Medline].
D. Eisenberg et al., Nature 405, 823 (2000) [Medline].
M. Ashburner et al., Nature Genet. 25, 25 (2000) [Medline].

The authors are in the Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA. E-mail: mark.gerstein@yale.edu

HyperNotes
Related Resources on the World Wide Web

General Hypernotes

Dictionaries and Glossaries: The On-line Medical Dictionary is provided by CancerWEB.; D. Glick's Glossary of Biochemistry and Molecular Biology is made available by Portland Press.; A glossary of genetic terms is provided by the National Human Genome Research Institute.; A genetics glossary is provided by the Biology Teaching Organisation, University of Edinburgh.; A genomics lexicon is offered on the Genomics: A Global Resource Web site, which is maintained by the Pharmaceutical Research and Manufacturers of America.; The Cambridge Healthtech Institute provides a collection of genomics glossaries and taxonomies.
Web Collections, References, and Resource Lists: Links to the Genetic World, provided by the Human Genome Program of the U.S. Department of Energy (DOE), includes collections of Internet resources related to proteomics and bioinformatics. A genome glossary is also provided.; Deambulum, an Internet resource for molecular biology, biocomputing, medicine, and biology, is provided by Infobiogen.; Biological Servers on the Web is provided by Atelier BioInformatique, Marseille, France. Collections of links to guides and tutorials in biology and analysis tools tutorials are provided.; The CMS Molecular Biology Resource is a compendium of electronic and Internet-accessible tools and resources for molecular biology, biotechnology, molecular evolution, biochemistry, and biomolecular modeling.; GenomeWeb is provided by the UK Human Genome Mapping Project Resource Centre.; Science's Functional Genomics Web site provides links to news, educational, and scientific resources in genomics and postgenomics.; BioWurld is a semiautomated index of resources in the fields of bioinformatics and molecular biology maintained by the European Bioinformatics Institute (EBI). EBI provides a proteome analysis database.; The U.S. National Center for Biotechnology Information (NCBI) is a resource for molecular biology information and databases. A genetics review is provided.; The ExPASy (Expert Protein Analysis System) Molecular Biology Server of the Swiss Institute of Bioinformatics is dedicated to the analysis of protein sequences and structures. Amos' WWW links page provides pointers to Internet information sources for life scientists with an interest in biological macromolecules.; The International Society for Computational Biology provides links to online courses, as well as links to Internet sites that collect links related to computational biology.
Online Texts and Lecture Notes: The MIT Biology Hypertextbook is an introduction to molecular biology.; A BioComputing Hypertext Coursebook is provided by the VSNS BioComputing Division, University of Bielefeld, Germany.; Molecular Genetics is a tutorial provided by U. Melcher, Department of Biochemistry and Molecular Biology, Oklahoma State University.; M. Hewlett, Department of Molecular and Cellular Biology, University of Arizona, offers lecture notes for a molecular biology course.; The Weizmann Institute of Science, Rehovot, Israel, offers an introduction to bioinformatics and lecture notes for a course on bioinformatics and computational genomics.; M. Gribskov, San Diego Supercomputer Center, and D. Smith, Division of Biology, University of California, San Diego, provide lecture notes for a course on bioinformatics. D. Smith offers lecture notes for a course on molecular biology.; R. Davis, Department of Biology, City University of New York Graduate Center, offers lecture notes and other resources for a course on bioinformatics and genomics. A bioinformatics glossary is provided.; M. Gerstein, Departments of Molecular Biophysics and Biochemistry and Computer Science, and Bioinformatics Research Group, Yale University, provides lecture notes for a course on genomics and bioinformatics.; G. Church, Department of Genetics and Lipper Center for Computational Genetics, Harvard Medical School, provides lecture notes for a course on genomics and computational biology.; J. Ullman, Department of Computer Science, Stanford University, provides lecture notes (in Adobe Acrobat format) for a course on data mining.; R. Altman, Biomedical Informatics Training Program, Stanford University School of Medicine, provides lecture slides and Internet resources for a course on representations and algorithms for computational molecular biology.
General Reports and Articles: The DOE's Human Genome Program presents a 2001 primer titled "Genomics and its impact on medicine and society" with a section on genome basics and a dictionary.; S. Wuchty, Bioinformatics and Computational Biochemistry Group, European Media Laboratory, Heidelberg, Germany, makes available (in Adobe Acrobat format) an article preprint titled "Interaction and domain networks of yeast."; The 16 February 2001 issue of Science had a Viewpoint article by S. Fields titled "Proteomics in genomeland." The 27 October 2000 issue had a Techview article by T. Attwood titled "The Babel of bioinformatics." The 24 October 1997 issue (a special genome issue on building gene families) had a Viewpoint article by P. Hieter and M. Boguski titled "Functional genomics: It's all how you read it."; The 10 April 2001 issue of the Proceedings of the National Academy of Sciences had a commentary by T. Hazbun and S. Fields titled "Networking proteins in yeast."

Numbered Hypernotes

Note: The authors of the Perspective have provided most of the following hypernotes.

As of December 2001, the GenBank repository of nucleic acid sequences contained over 800 whole genome sequences. Structural Genomics Initiatives aim at large-scale determination of protein structures. Functional genomics elucidate function of all the proteins in a genome, typically through mRNA expression, protein interaction assays, gene disruption, or proteome microarray.
Interactomes. The Biomolecular Interaction Network Database (BIND) is a database designed to store full descriptions of interactions, molecular complexes and pathways; as of December 2001, it stored records of 5940 interactions, 54 complexes, and 7 pathways. The C. elegans interactome project, directed by M. Vidal, Dana Farber Cancer Institute and Harvard Medical School, aims to generate a comprehensive protein-protein interaction map for C. elegans. The ProChart database form AxCell Biosciences is currently one of the most comprehensive databases of human protein interactions; a 1 June 2001 press release about the product is available. The Proteomic Pathway Project at BIOCARTA provides dynamic graphic models for gene interaction. Kohn Molecular Interaction Maps, designed by K. Kohn, Laboratory of Molecular Pharmacology, National Cancer Institute, are molecular interaction maps of mammalian cell cycle control and DNA repair systems and amenable for further development.
A. H. Y. Tong and C. Boone are in the Banting and Best Department of Medical Research and Graduate Department of Molecular and Medical Genetics, University of Toronto. B. Nelson and M. Evangelista are in the Department of Biology, Queens University, Kingston, Ontario, Canada. G. Nardelli, B. Brannetti, L. Castagnoli, S. Ferracuti, S. Paoluzi, M. Quondam, A. Zucconi, and G. Cesareni are in the Department of Biology, University of Rome Tor Vergata. S. Fields is in the Department of Genome Sciences, University of Washington, and is an investigator at the Howard Hughes Medical Institute. B. Drees (formerly at the Fields lab) is at the Institute for Systems Biology, Seattle. G Bader and C. Hogue are at the Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, and in the Department of Biochemistry, University of Toronto.
Phage display technology. Dyax Corporation provides an introduction to phage display. R. Kontermann, Institute of Molecular Biology and Tumor Research, University of Marburg, Germany, provides an introduction to phage display. The German Research Center for Biotechnology makes available a research presentation by J. Collins and P. Röttgen titled "Evolutive phage-display -- A method as basis for a company" that appeared in the 1997 annual report. The March/April 1999 Molecular Biology Newsletter from the Molecular Biology Program, University of Missouri, Columbia, had an article by G. Smith and S. Deutscher titled "Searching vast tracts of sequence space with phage display." G. Smith, Division of Biological Sciences, University of Missouri, Columbia, offers a phage display Web page.
The yeast two-hybrid system. The yeast two-hybrid (YTH) system was first established by O. Song and S. Fields in a 20 July 1989 article in Nature titled "A novel genetic system to detect protein-protein interactions." The Fields Lab home page at the Department of Genome Sciences, University of Washington, provides an introduction to the two-hybrid system. A presentation on the two-hybrid system for the detection of protein-protein interactions is provided by the companion Web site for S. Gilbert's Developmental Biology. R. Brent's lab at the Molecular Sciences Institute, Berkeley, has developed a variant of the system, and S. Elledge's lab at the Baylor College of Medicine has contributed a lot to this technology. The Finley Lab home page at the Center For Molecular Medicine and Genetics, Wayne State University School of Medicine, makes available a chapter by R. Finley and R. Brent titled "Two-hybrid analysis of genetic regulatory networks" and a presentation titled "Examining the function of proteins and protein networks with the yeast two-hybrid system" as well as collections of two-hybrid protocols and links. C. Götting's Biomedpage provides a collection of Internet links related to yeast two-hybrid systems.
The SH3 domain is one of the best-characterized members of the growing family of protein-interaction modules. The basic fold contains five anti-parallel beta-strands packed to form two perpendicular beta-sheets. It is cataloged as entry IPR001452 in InterPro. A presentation on the SH3 domain is provided by the Protein Kinase Resource's 3-D Structure Web site.
Combination and analysis of data sets. Standard methods for expression data include hierarchical clustering, k-means clustering, self-organizing maps, and principle component analysis. I. Jonassen's Web page provides references on these methods. Expression analysis software has been developed using various combinations of these methods, such as Cluster and TreeView, EPCLUST, and CLEAVER. One new combination method is local clustering, which finds time-shifted and inverted relationships. Expression databases include GEO (Gene Expression Omnibus, from NCBI), ExpressDB (from the Harvard-Lipper Center for Computational Genetics), GeneX (from the National Center for Genome Resources), the Stanford Microarray Database, and ArrayExpress (from EBI). A comprehensive list of links to expression databases and software is provided by the EBI Expression Profiler Web page.
This estimate is built from a simple Bayesian network model of three binary-type data sources, with random noise. An example of binary-type data is protein-protein interactions ("YES, there is an interaction" or "NO, there isn't"). The Gerstein Lab Web page provides a detailed discussion of the calculation with a spreadsheet provided.
Correlation of expression data. There have been efforts to correlate expression data with transcriptional regulation, functional categories, protein folds and families, and protein abundance; to map between transcriptome and interactome; to integrate sequence and expression data to predict subcellular localization; and to combine whole-genome data to infer functional linkages of proteins. GenMAPP, developed at the J. David Gladstone Institutes, University of California, San Francisco, was designed to visualize gene expression data on maps representing biological pathways. Biomolecular Relations in Information Transmission and Expression (BRITE) is a database of binary relations involving genes and proteins, including interactions that underlie the pathway diagrams of KEGG (Kyoto Encyclopedia of Genes and Genomes), protein-protein interactions by Y2H systems, sequence similarity relations by SSEARCH, and expression similarity relations by microarray gene expression profiles.
A myriad of terms suffixed with "ome" have been coined to define the varied cellular populations and subpopulations, ranging from the most well-known genome, which first appeared in the biomedical literature in 1932, to the newly coined terms that have yet to gain wide recognition, such as foldome and transportome. The Cambridge Healthtech Institute provides an -omes and -omics glossary.
Synthesis of more than two types of genomic information. Eisenberg et al. ("Protein function in the post-genomic era" in the 15 June 2000 issue of Nature) defined functional protein networks combining experimentally determined interactions from DIP (Database of Interacting Proteins) with those computationally predicted from protein phylogenetic profiles and expression data. The union of various data sets was counted while higher reliability was given to interactions predicted by two or more methods. The Allfuse Database developed by EBI's Computational Genomics Group aims to find functional associations of proteins in complete genomes though gene fusion.
A previous interactome analysis. Schwikowski et al. ("A network of protein-protein interactions in yeast" by B. Schwikowski, P. Uetz , and S. Fields in the December 2000 issue of Nature Biotechnology) obtained a single large network from the union of recorded interactions in the MIPS yeast genome database and DIP and data from Y2H experiments carried out by Ito et al. and Uetz et al. and thereby computed a map of interactions between functional groups.
The yeast S. cerevisiae and its genome. F. Sherman, Department of Biochemistry and Biophysics, University of Rochester Medical School, offers a presentation titled "An introduction to the genetics and molecular biology of the yeast Saccharomyces cerevisiae." J. Huberman, Department of Cancer Genetics, Roswell Park Cancer Institute, offers lecture notes on yeast genetics. The Yeast Resource Center at the University of Washington provides information to facilitate the identification and characterization of protein complexes in the yeast Saccharomyces cerevisiae. The Munich Information Center for Protein Sequences (MIPS) provides the Comprehensive Yeast Genome Database (CYGD), as well as an introduction to S. cerevisiae, a collection of Internet links, and a glossary. The Saccharomyces Genome Database (SGD) is maintained by the Department of Genetics, Stanford University School of Medicine; a collection of links to yeast WWW sites is maintained. SGD provides Sacch3D, which presents structural information for every protein in the yeast genome. The complete S. cerevisiae genome is made available by NCBI. YPD (Yeast Proteome Database) is provided by Proteome, Inc. The S. cerevisiae Cluster Analysis Server, provided by the Yale Bioinformatics Research Group, offers graphic views of statistical data in yeast genome analysis; a collection of yeast links is provided. The Yale Genome Analysis Center offers the TRIPLES database, which provides public access to the data generated from the center's transposon-tagging study of the yeast genome. Nature's Genome Gateway provides a yeast genome directory.
Further details of the graph in this figure are available at genecensus.org/integrate/interactions (alternate site). Four data sets were integrated: (1) yeast cell cycle data (time course expression data); (2) Rosetta knockout data at ExpressDB (non-time course expression data); (3) essentiality data from MIPS CYGD and (in conditional form) from the phenotype-profiling transposon experiments of M. Snyder; and (4) localization data from a localization Web site (set #4), which merges data from MIPS CYGD, ExPASy's SWISS-PROT, and the Snyder lab. YPD also has lots of localization data.
Bayesian networks. Bayesware Limited provides knowledge discovery and data mining software based on Bayesian methods. K. Murphy, Department of Computer Science, University of California, Berkeley, presents a tutorial on graphical models and Bayesian networks, as well as a collection of links to software packages for graphical models/Bayesian networks. N. Friedman, School of Computer Science and Engineering, Hebrew University, Jerusalem, makes available a tutorial on Bayesian networks., as well as an article by N. Friedman, I. Nachman, and D. Pe'er titled "Using Bayesian networks to analyze expression data" (the article is also available from the NEC Research Institute's ReseachIndex literature database). The Laboratory for Advanced Database Systems, Department of Computer Science, University of Kentucky, provides links to presentations on Bayesian nets.
The name given to a gene by its discoverer may not be related to its function, and sometimes can be misleading, as discussed in the Science Observer column by M. Vacek titled "A gene by any other name" in the November-December 2001 issue of American Scientist.
Functional categories for proteins. Hierarchical systems to represent protein function for single organisms include the CYGD functional classification catalog (yeast), GenProtEC (E. coli), FlyBase (EBI mirror site) (Drosophila), and EGAD (human expressed sequence tags). The Gene Ontology Project provides a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. ExPASy's ENZYME nomenclature database classifies enzyme function. MetaCyc: Metabolic Encyclopedia, WIT (What Is There?), and KEGG describe gene function in terms of metabolic pathways and reactions. Protein function can also be inferred from interaction maps available at DIP, ProNet Online, and BIND.
Groups working to define protein function. Of the many groups that are currently working on inferring and defining protein function, here are some representative ones: S. Brenner, Computational Genomics Research Group, University of California, Berkeley; J. Thornton, Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London; B. Honig, Department of Biochemistry and Molecular Biophysics, Columbia University; A. Rzhetsky and W. Noble (with P. Pavlidis), Columbia Genome Center, Columbia University; D. Eisenberg, UCLA-DOE Laboratory of Structural Biology and Molecular Medicine; and E. Koonin at NCBI. The Gerstein Lab's integrated function site is under construction.
M. Gerstein, N. Lang, and R. Jansen are in the Department of Molecular Biophysics and Biochemistry, Yale University.

Summary of this Article

Reprint (PDF) Version of this Article

dEbates: Submit a response to this article

Related articles in Science

Similar articles found in:
SCIENCE Online
PubMed

PubMed Citation

Search Medline for articles by:
Gerstein, M. || Jansen, R.

Alert me when:
new articles cite this article

Download to Citation Manager

Collections under which this article appears:
Enhanced Content

Genetics

Also see the archival list of Science's Compass: Enhanced Perspectives

PROTEOMICS:Enhanced: Integrating Interactomes

HyperNotesRelated Resources on the World Wide Web

General Hypernotes

Numbered Hypernotes

Related articles in Science:

PROTEOMICS:
Enhanced: Integrating Interactomes

HyperNotes
Related Resources on the World Wide Web