Also see the archival
list of Science's Compass: Enhanced
PROTEOMICS:Mark Gerstein, Ning Lan, Ronald
With the human genome sequence as an
intellectual inspiration and practical scaffold, scientists are
ready to perform experiments on all genes [HN1].
Integrating the resulting genomewide information into useful
definitions of protein function is a huge challenge. Exactly what
form such functional definitions will take is still debatable, but
comprehensive networks of protein-protein interactions, or
interactomes, should prove valuable in helping to shape them [HN2].
On page 321
of this issue, Tong et al. [HN3]
describe a systematic approach for identifying protein-protein
interaction networks in which different peptide recognition domains
participate. They break new ground in the way they combine
"orthogonal" (that is, fundamentally different) sets of genomic
information. Specifically, they study the intersection of two
different interactomes. The first is derived from screening
phage-display peptide libraries to find consensus sequences in yeast
proteins that bind to particular peptide recognition domains. The
resulting network connects proteins with recognition domains to
those containing the consensus. This network partially defines
binding sites in some of the proteins and represents a clever use of
phage display technology [HN4].
The second network is derived from experimentally testing each
peptide recognition module, using the yeast two-hybrid technique [HN5],
for association with possible protein-binding partners. Tong et
al. apply their approach to determine interacting partners for
SH3 domains [HN6]
in yeast proteins. These domains make good targets because of their
prevalence and involvement in a number of important biological
processes, from cytoskeleton reorganization to signal transduction.
The power of Tong et al.'s strategy, particularly for
reducing noise, becomes manifest when interpreting large genomic
data sets. One fallacy in dealing with genomic data sets is
ascribing too much meaning to individual data points. Many data sets
(for example, gene expression profiles) contain so much noise that
it can be difficult to draw reliable conclusions for specific genes.
These data sets still offer much useful information statistically,
in terms of broad trends, but they are useful only insofar as the
data can be aggregated. This can be simply achieved by combining
replicates of an experiment, but such a process does not remove
systematic errors. It is also possible to collect many individual
measurements on different proteins into aggregate "proteomic
classes," for example, functional categories, and to compare these
The new work points to perhaps the most powerful approach:
interrelating and integrating orthogonal information. In the
abstract, it is easy to demonstrate that combining independent data
sets results in a lower error rate overall. For instance, combining
three independent binary-type data sets with error rates of 10%
reduces the overall error rate to 2.8% (for both false positives and
Moreover, interrelating two different types of whole-genome data
also enables one to discover potentially important but not obvious
relationships--for example, between gene expression and the position
of genes on chromosomes, or between gene expression and the
subcellular localization of proteins (8,
There have been a number of previous attempts to interrelate
information from different genomic data sets. For instance, gene
expression profiles were initially analyzed by a variety of
supervised and unsupervised methods--hierarchical trees, k-means,
self-organizing maps, and support-vector machines--and compared with
protein function categories (10-14).
Gene expression data were also compared with data sets
describing transcription factor binding sites, protein families,
protein-protein interactions, and protein abundance [HN9]
In a shorthand sense, much of this can be thought of as
interrelating the transcriptome (population of mRNA transcripts)
with other "omes" such as the proteome, translatome, secretome, and
There are considerably fewer examples of the synthesis of more
than two types of genomic information [HN11].
One initial attempt combined gene expression correlations,
phylogenetic profiles, and patterns of domain fusion to predict
protein function (21,
Bayesian statistics were used to integrate gene expression,
"essentiality" (the degree to which a gene is essential for
survival), and sequence motif data into a uniform framework for the
prediction of protein subcellular localization (20).
Tong et al.'s strategy of overlapping interactomes presents
a new type of synthesis. It is particularly effective in that their
two data sets are orthogonal in many respects. Phage display is
based on in vitro binding of short peptides, whereas the two-hybrid
approach assays in vivo binding between full-length proteins.
Moreover, the phage display network is computationally predicted but
uses relatively unambiguous consensus sequences, whereas the
two-hybrid network is experimentally derived but suffers from
appreciable false positives (23,
From a data-mining standpoint, the heterogeneous character and
variable quality of whole-genome information makes integration
tricky. Consider combining "orthogonal" interactome data sets, such
as attempted by Tong et al., in a general sense. How might
one proceed formally? There are two extremes (see the figure,
below). At one extreme, the data sets have low false-negative but
high false-positive error rates. That is, each experiment almost
never misses real interactions but also finds many spurious ones. In
this situation, the benefit of integration comes from intersection:
Only interactions common to all are accepted, thus lowering the
combined error rate. Tong et al.'s approach fits this to
some degree. At the other extreme are data sets with few false
positives but low coverage of the space of interactions. The benefit
of integration then comes from the union: Any interaction found in
at least one data set is accepted. An earlier interactome analysis
followed this to some degree [HN12]
Overlapping nets. Two
different extremes in integrating interactomes. The combined network
on the left is the union of those interactomes with low
false-positive but high false-negative rates, whereas the combined
network on the right is the intersection of interactomes with low
false-negative but high false-positive rates. Circles represent
proteins; links, interactions; and dotted lines, known associations.
Thicker links indicate lower false-positive rates. More effective
rules for combining networks than union and intersection take into
account the different error rates associated with each link type.
In most practical situations, the optimal way to integrate data
sets is somewhere between these extremes. The task is to combine
data sets with varying error rates and coverage. Accordingly, the
rules for identifying positives become more complicated. Instead of
simple unions or intersections, different combinations of positive
and negative signals from the data sets should be considered, taking
into account their relative false-positive and -negative rates.
A practical illustration of the power of interrelating genomic
data for yeast [HN13]
(see the figure, below) shows the degree to which one can find
protein-protein associations in known protein complexes (5,
by stepwise integration of increasing amounts of orthogonal genomic
We start by considering associations that can be found from gene
expression correlations over the cell cycle (27);
then we incorporate those derived from a second but different
microarray experiment, which provides a series of gene expression
changes after specific genes have been knocked out (28).
Finally, we add associations predicted from genomic measurements of
essentiality and localization (20,
As we integrate more information, the total number of correctly
identified interactions rises (especially for the union of the
predicted associations). Simultaneously, the error rate decreases.
Moreover, if we focus just on the intersection of the predicted
associations, the error rate falls even more.
A net profit from integration.
Integrating progressively more orthogonal information
identifies more and more associations (5-7).
From the known complexes in yeast, there are 8250 protein-protein
The y axis shows the percentage of these identified by
disparate genomic data (that is, coverage). The x axis
shows the progressive addition of genomic data. The first two bars
represent the protein associations with the most significant
expression correlation in two different microarray sets (27,
The next two represent adding the associations predicted because
both proteins were similarly essential for cell survival
("essentiality") or had similar subcellular localization (20,
The color shading on the bars roughly indicates false-positive rates
throughout the integration. Although it is reasonable that
associated components of complexes will have correlated expression
and similar localization and "essentiality," this is only weakly
predictive, generating many spurious positives. Consequently, the
"weak links" case in the right hand panel of the previous figure
mostly applies, and the shading indicates how intersection lowers
the error rate.
A future challenge will be to devise uniform frameworks for
integrating information from both high-throughput and traditional
biochemical approaches. One aspect of this will be to develop better
databases for storing and querying heterogeneous information. In
particular, databases will need to be more precise in their
treatment of errors and also interface better with the information
in journals. Another aspect will be to develop data-mining
strategies that can operate with these databases, integrating many
different genomic features into results pertinent to biology.
Genomic features can be of very different character (from hundreds
of "Booleans" for interactions, to tens of thousands of real-number
vectors for expression profiles), and a central issue in integration
is determining how to weight each feature relative to the others. In
this regard, some machine-learning techniques, such as Bayesian
and decision trees, are quite powerful, whereas others, for example,
support-vector machines, are more problematic.
Finally, we will need to come up with a more systematic
definition of gene function, the ultimate aim of proteomic
investigation. To many scientists, what constitutes "function" is a
phrase or name often in nonsystematic terminology, such as "ATPase"
or "suppressor of white apricot [HN16]."
Such descriptions are sufficient for single-molecule work but cannot
be scaled up to the genomic level. More systematic attempts have
been made to place proteins within a hierarchy of standard
functional categories or to connect them in overlapping networks of
varying types of association [HN17]
These networks can obviously include protein-protein interactions,
the subject of Tong et al.'s work. More broadly, they can
include pathways, regulatory systems, and signaling cascades. How
far are we able to go with this network approach? Perhaps, in the
future, the systematic combination of networks may provide for a
truly rigorous definition of protein function [HN18].
- A. H. Y. Tong et al., Science
(2002); published online 13 December 2001
- Interaction data from Biomolecular Interaction Network
- D. Greenbaum et al., Genome Res.
11, 1463 (2001) [Medline].
- R. Jansen, M. Gerstein, Nucleic Acids Res.
28, 1481 (2000) [Medline].
- J. Qian et al., J. Mol. Biol.
314, 1053 (2001) [Medline].
- R. Jansen et al., Genome Res.
12, 37 (2002).
- Details at http://genecensus.org/integrate/interactions.
- B. A. Cohen et al., Nature Genet.
26, 183 (2000) [Medline].
- A. Drawid et al., Trends Genet.
16, 426 (2000) [Medline].
- P. Tamayo et al., Proc. Natl. Acad. Sci. U.S.A.
96, 2907 (1999) [Medline]
- M. B. Eisen et al., Proc. Natl. Acad. Sci.
U.S.A. 95, 14863 (1998) [Medline]
- S. Tavazoie et al., Nature Genet.
22, 281 (1999) [Medline].
- M. Gerstein, R. Jansen, Curr. Opin. Struct. Biol.
10, 574 (2000) [Medline].
- M. Brown et al., Proc. Natl. Acad. Sci. U.S.A.
97, 262 (2000) [Medline]
- H. Ge et al., Nature Genet.
29, 482 (2001) [Medline].
- F. Roth et al., Nature Biotechnol.
16, 939 (1998) [Medline].
- S. Gygi et al., Mol. Cell. Biol.
19, 1720 (1999) [Medline]
- B. Futcher et al., Mol. Cell. Biol.
19, 7357 (1999) [Medline]
- A. Brazma et al., Genome Res.
8, 1202 (1998) [Medline].
- A. Drawid, M. Gerstein, J. Mol. Biol.
301, 1059 (2000) [Medline].
- E. Marcotte et al., Science
- E. Marcotte et al., Nature
402, 83 (1999) [Medline].
- T. Ito et al., Proc. Natl. Acad. Sci. U.S.A.
98, 4569 (2001) [Medline]
- P. Uetz et al., Nature 403,
623 (2000) [Medline].
- B. Schwikowski et al., Nature Biotechnol.
18, 1257 (2000) [Medline].
- H. W. Mewes et al., Nucleic Acids Res.
28, 37 (2000) [Medline].
- R. J. Cho et al., Mol. Cell
2, 65 (1998) [Medline].
- T. R. Hughes et al., Cell
102, 109 (2000) [Medline].
- E. A. Winzeler et al., Science
- P. Ross-Macdonald et al., Nature
402, 413 (1999) [Medline].
- D. Eisenberg et al., Nature
405, 823 (2000) [Medline].
- M. Ashburner et al., Nature Genet.
25, 25 (2000) [Medline].
The authors are in the Department of Molecular Biophysics and
Biochemistry, Yale University, New Haven, CT 06520, USA. E-mail: firstname.lastname@example.org
Related Resources on the World Wide
- Dictionaries and
- The On-line
Medical Dictionary is provided by CancerWEB.
- D. Glick's Glossary of
Biochemistry and Molecular Biology is made available by Portland
- A glossary
of genetic terms is provided by the National Human Genome Research
- A genetics
glossary is provided by the Biology
Teaching Organisation, University of Edinburgh.
- A genomics
lexicon is offered on the Genomics: A Global Resource
Web site, which is maintained by the Pharmaceutical
Research and Manufacturers of America.
- The Cambridge Healthtech
Institute provides a collection
of genomics glossaries and taxonomies.
- Web Collections, References, and
- Links to the
Genetic World, provided by the Human Genome Program of the
U.S. Department of Energy (DOE), includes collections of Internet
resources related to proteomics
glossary is also provided.
an Internet resource for molecular biology, biocomputing,
medicine, and biology, is provided by Infobiogen.
Servers on the Web is provided by Atelier
BioInformatique, Marseille, France. Collections of links to guides
and tutorials in biology and analysis
tools tutorials are provided.
- The CMS Molecular Biology
Resource is a compendium of electronic and Internet-accessible
tools and resources for molecular biology, biotechnology,
molecular evolution, biochemistry, and biomolecular modeling.
is provided by the UK Human
Genome Mapping Project Resource Centre.
- Science's Functional
Genomics Web site provides links to news, educational, and
scientific resources in genomics and postgenomics.
is a semiautomated index of resources in the fields of
bioinformatics and molecular biology maintained by the European Bioinformatics
Institute (EBI). EBI provides a proteome analysis
- The U.S. National
Center for Biotechnology Information (NCBI) is a resource for
molecular biology information and databases. A genetics
review is provided.
- The ExPASy (Expert
Protein Analysis System) Molecular Biology Server of the Swiss Institute of
Bioinformatics is dedicated to the analysis of protein
sequences and structures. Amos' WWW links page
provides pointers to Internet information sources for life
scientists with an interest in biological macromolecules.
- The International Society for
Computational Biology provides links to online courses, as
well as links to
Internet sites that collect links related to computational
- Online Texts and Lecture
- The MIT
Biology Hypertextbook is an introduction to molecular biology.
- A BioComputing
Hypertext Coursebook is provided by the VSNS
BioComputing Division, University of Bielefeld, Germany.
Genetics is a tutorial provided by U. Melcher,
Department of Biochemistry and Molecular Biology, Oklahoma State
Hewlett, Department of Molecular and Cellular Biology,
University of Arizona, offers lecture
notes for a molecular
- The Weizmann Institute of
Science, Rehovot, Israel, offers an introduction
to bioinformatics and lecture
notes for a course
on bioinformatics and computational genomics.
- M. Gribskov, San
Diego Supercomputer Center, and D.
Smith, Division of Biology, University of California, San
Diego, provide lecture notes for a course
on bioinformatics. D. Smith
notes for a course on molecular
Davis, Department of Biology, City University of New York
Graduate Center, offers lecture
notes and other resources for a course
on bioinformatics and genomics. A bioinformatics
glossary is provided.
Gerstein, Departments of Molecular Biophysics and Biochemistry
and Computer Science, and Bioinformatics Research
Group, Yale University, provides lecture notes for a course on
genomics and bioinformatics.
Church, Department of Genetics and Lipper Center for
Computational Genetics, Harvard Medical School, provides lecture
notes for a course
on genomics and computational biology.
- J. Ullman,
Department of Computer Science, Stanford University, provides lecture
notes (in Adobe Acrobat format) for a course
on data mining.
Altman, Biomedical Informatics Training Program, Stanford
University School of Medicine, provides lecture slides and
Internet resources for a course
on representations and algorithms for computational molecular
- General Reports and Articles
- The DOE's Human Genome
Program presents a 2001
primer titled "Genomics and its impact on medicine and
society" with a section on genome
basics and a dictionary.
Wuchty, Bioinformatics and
Computational Biochemistry Group, European
Media Laboratory, Heidelberg, Germany, makes available (in
Adobe Acrobat format) an article
preprint titled "Interaction and domain networks of yeast."
- The 16 February 2001 issue of Science had a Viewpoint
article by S. Fields titled "Proteomics in genomeland." The 27
October 2000 issue had a Techview
article by T. Attwood titled "The Babel of bioinformatics."
The 24 October 1997 issue (a special
genome issue on building gene families) had a Viewpoint
article by P. Hieter and M. Boguski titled "Functional
genomics: It's all how you read it."
- The 10 April 2001 issue of the Proceedings of the National Academy
of Sciences had a commentary
by T. Hazbun and S. Fields titled "Networking proteins in yeast."
Note: The authors
of the Perspective have provided most of the following
- As of December 2001, the GenBank
repository of nucleic acid sequences contained over 800
whole genome sequences. Structural
Genomics Initiatives aim at large-scale determination of
protein structures. Functional
genomics elucidate function of all the proteins in a genome,
typically through mRNA
interaction assays, gene disruption, or
- Interactomes. The Biomolecular Interaction Network
Database (BIND) is a database designed to store full
descriptions of interactions, molecular complexes and pathways; as
of December 2001, it stored records of 5940 interactions,
and 7 pathways.
elegans interactome project, directed by M.
Vidal, Dana Farber Cancer Institute and Harvard Medical
School, aims to generate a comprehensive protein-protein
interaction map for C.
elegans. The ProChart
database form AxCell Biosciences
is currently one of the most comprehensive databases of human
protein interactions; a 1 June 2001 press release
about the product is available. The Proteomic Pathway
Project at BIOCARTA
provides dynamic graphic models for gene interaction. Kohn
Molecular Interaction Maps, designed by K. Kohn,
Laboratory of Molecular Pharmacology, National Cancer Institute,
interaction maps of mammalian cell cycle control and DNA
repair systems and amenable for further development.
- A. H. Y. Tong and C. Boone
are in the Banting and
Best Department of Medical Research and Graduate Department of
Molecular and Medical Genetics, University of Toronto. B.
Nelson and M. Evangelista are in the Department of Biology,
Queens University, Kingston, Ontario, Canada. G. Nardelli,
B. Brannetti, L.
Castagnoli, S. Ferracuti, S. Paoluzi, M.
Quondam, A. Zucconi, and G.
Cesareni are in the Department of
Biology, University of
Rome Tor Vergata. S.
Fields is in the Department of Genome
Sciences, University of Washington, and is an investigator
at the Howard Hughes Medical Institute. B. Drees (formerly
at the Fields lab) is at the Institute for
Systems Biology, Seattle. G
Bader and C.
Hogue are at the Samuel
Lunenfeld Research Institute, Mount Sinai Hospital, Toronto,
and in the Department of
Biochemistry, University of Toronto.
- Phage display technology. Dyax Corporation provides
an introduction to phage display.
Kontermann, Institute of Molecular Biology and Tumor Research,
University of Marburg, Germany, provides an introduction
to phage display. The German Research Center for
Biotechnology makes available a research
presentation by J. Collins and P. Röttgen titled "Evolutive
phage-display -- A method as basis for a company" that appeared in
annual report. The March/April 1999 Molecular
Biology Newsletter from the Molecular
Biology Program, University of Missouri, Columbia, had an article
by G. Smith and S. Deutscher titled "Searching vast tracts of
sequence space with phage display." G. Smith,
Division of Biological Sciences, University of Missouri, Columbia,
offers a phage
display Web page.
- The yeast two-hybrid system. The yeast
two-hybrid (YTH) system was first established by O. Song and
S. Fields in a 20
July 1989 article in Nature titled "A novel genetic
system to detect protein-protein interactions." The Fields Lab
home page at the Department of Genome
Sciences, University of Washington, provides an introduction
to the two-hybrid system. A presentation on the two-hybrid
system for the detection of protein-protein interactions is
provided by the companion Web
site for S. Gilbert's Developmental Biology. R. Brent's
lab at the Molecular Sciences
Institute, Berkeley, has developed a variant
of the system, and S. Elledge's
lab at the Baylor College
of Medicine has contributed a lot to this technology.
Lab home page at the Center For
Molecular Medicine and Genetics, Wayne State University School
of Medicine, makes available a chapter
by R. Finley and R. Brent titled "Two-hybrid analysis of genetic
regulatory networks" and a presentation
titled "Examining the function of proteins and protein networks
with the yeast two-hybrid system" as well as collections of two-hybrid
protocols and links.
C. Götting's Biomedpage
provides a collection
of Internet links related to yeast two-hybrid systems.
- The SH3
domain is one of the best-characterized members of the
growing family of protein-interaction
modules. The basic fold contains five anti-parallel
beta-strands packed to form two perpendicular beta-sheets. It
is cataloged as entry IPR001452
in InterPro. A
presentation on the SH3
domain is provided by the Protein
Kinase Resource's 3-D Structure Web site.
- Combination and analysis of data
sets. Standard methods for expression data include hierarchical
maps, and principle
component analysis. I.
Jonassen's Web page provides references on these
methods. Expression analysis software has been developed using
various combinations of these methods, such as Cluster and
TreeView, EPCLUST, and CLEAVER. One new
combination method is local
clustering, which finds time-shifted and inverted
relationships. Expression databases include GEO (Gene Expression
Omnibus, from NCBI), ExpressDB (from
the Harvard-Lipper Center
for Computational Genetics), GeneX (from the National Center for Genome
Resources), the Stanford
Microarray Database, and ArrayExpress (from
EBI). A comprehensive list of
links to expression
databases and software is provided by the EBI Expression Profiler Web page.
- This estimate is built from a simple
Bayesian network model of three binary-type data sources, with
random noise. An example of binary-type data is protein-protein
interactions ("YES, there is an interaction" or "NO, there
isn't"). The Gerstein Lab
Web page provides a detailed discussion
of the calculation with a spreadsheet provided.
- Correlation of expression data. There
have been efforts to correlate expression data with transcriptional
categories, protein folds
and families, and protein
abundance; to map between transcriptome and
interactome; to integrate sequence and expression data to
localization; and to combine whole-genome data to infer
functional linkages of proteins. GenMAPP,
developed at the J. David
Gladstone Institutes, University of California, San Francisco,
was designed to visualize gene expression data on maps
representing biological pathways. Biomolecular Relations in
Information Transmission and Expression (BRITE) is a database
of binary relations involving genes and proteins, including
interactions that underlie the pathway
diagrams of KEGG
(Kyoto Encyclopedia of Genes and Genomes), protein-protein
interactions by Y2H systems, sequence similarity relations by SSEARCH,
and expression similarity relations by microarray
gene expression profiles.
- A myriad of terms
suffixed with "ome" have been coined to define the varied
cellular populations and subpopulations, ranging from the most
which first appeared in the biomedical literature in 1932, to the
newly coined terms that have yet to gain wide recognition, such as
The Cambridge Healthtech
Institute provides an -omes and
- Synthesis of more than two types of
genomic information. Eisenberg
et al. ("Protein function in the post-genomic era" in
the 15 June 2000 issue of Nature) defined functional
protein networks combining experimentally determined
interactions from DIP
(Database of Interacting Proteins) with those computationally
predicted from protein
phylogenetic profiles and expression
data. The union of various data sets was counted while higher
reliability was given to interactions predicted by two or more
methods. The Allfuse
Database developed by EBI's Computational Genomics
Group aims to find functional
associations of proteins in complete genomes though gene
- A previous interactome analysis. Schwikowski
et al. ("A network of protein-protein interactions in
yeast" by B.
Schwikowski, P. Uetz , and S. Fields in the December 2000
issue of Nature Biotechnology) obtained a single
large network from the union of recorded interactions in the
yeast genome database and DIP and data from Y2H
experiments carried out by Ito
et al. and Uetz
et al. and thereby computed a map
of interactions between functional groups.
- The yeast S. cerevisiae and its
Sherman, Department of Biochemistry and Biophysics, University
of Rochester Medical School, offers a presentation
titled "An introduction to the genetics and molecular biology of
the yeast Saccharomyces cerevisiae." J.
Huberman, Department of Cancer Genetics, Roswell Park Cancer
Institute, offers lecture
notes on yeast genetics. The Yeast Resource
Center at the University of Washington provides information to
facilitate the identification and characterization of protein
complexes in the yeast Saccharomyces cerevisiae. The Munich Information Center for Protein
Sequences (MIPS) provides the Comprehensive
Yeast Genome Database (CYGD), as well as an introduction
to S. cerevisiae, a collection
of Internet links, and a glossary.
Genome Database (SGD) is maintained by the Department of
Genetics, Stanford University School of Medicine; a collection
of links to yeast
WWW sites is maintained. SGD provides Sacch3D, which
presents structural information for every protein in the yeast
genome. The complete S.
cerevisiae genome is made available by NCBI. YPD (Yeast
Proteome Database) is provided by Proteome, Inc. The
cerevisiae Cluster Analysis Server, provided by the Yale Bioinformatics Research
Group, offers graphic views of statistical data in yeast
genome analysis; a collection of yeast
links is provided. The Yale Genome Analysis
Center offers the TRIPLES
database, which provides public access to the data generated
from the center's transposon-tagging study of the yeast genome.
Gateway provides a yeast
- Further details of the graph in this
figure are available at genecensus.org/integrate/interactions
site). Four data sets were integrated: (1) yeast cell cycle
data (time course expression data); (2) Rosetta knockout data at ExpressDB
(non-time course expression data); (3) essentiality
data from MIPS
CYGD and (in conditional form) from the phenotype-profiling transposon
experiments of M.
Snyder; and (4) localization data from a localization Web site
(set #4), which merges data from MIPS
CYGD, ExPASy's SWISS-PROT, and the Snyder lab. YPD also
has lots of localization data.
- Bayesian networks. Bayesware Limited provides
knowledge discovery and data mining software based on Bayesian
Murphy, Department of Computer Science, University of
California, Berkeley, presents a tutorial
on graphical models and Bayesian networks, as well as a collection
of links to software packages for graphical models/Bayesian
Friedman, School of Computer Science and Engineering, Hebrew
University, Jerusalem, makes available a tutorial
on Bayesian networks., as well as an article
by N. Friedman, I. Nachman, and D. Pe'er titled "Using Bayesian
networks to analyze expression data" (the article
is also available from the NEC Research Institute's ReseachIndex literature
database). The Laboratory for
Advanced Database Systems, Department of Computer Science,
University of Kentucky, provides links to presentations
on Bayesian nets.
- The name given to a gene by its
discoverer may not be related to its function, and sometimes can
be misleading, as discussed in the Science
Observer column by M. Vacek titled "A gene by any other name"
in the November-December 2001 issue of American
- Functional categories for proteins.
Hierarchical systems to represent protein function for single
organisms include the CYGD
functional classification catalog (yeast), GenProtEC (E.
(EBI mirror site)
(Drosophila), and EGAD (human
expressed sequence tags). The Gene Ontology Project
provides a set of structured vocabularies for specific biological
domains that can be used to describe gene products in any
organism. ExPASy's ENZYME nomenclature
database classifies enzyme function. MetaCyc: Metabolic
(What Is There?), and KEGG describe gene
function in terms of metabolic pathways and reactions. Protein
function can also be inferred from interaction maps available at
DIP, ProNet Online, and BIND.
- Groups working to define protein
function. Of the many groups that are currently working on
inferring and defining protein function, here are some
representative ones: S.
Genomics Research Group, University of California, Berkeley;
and Modelling Group, Department of Biochemistry and Molecular
Biology, University College London; B. Honig, Department
of Biochemistry and Molecular Biophysics, Columbia University; A. Rzhetsky
and W. Noble (with
P. Pavlidis), Columbia Genome
Center, Columbia University; D.
Eisenberg, UCLA-DOE Laboratory of Structural Biology and
Molecular Medicine; and E.
Koonin at NCBI. The
Gerstein Lab's integrated
function site is under construction.
Gerstein, N. Lang, and
Jansen are in the Department of Molecular Biophysics
and Biochemistry, Yale University.
Related articles in Science:
- A Combined Experimental and Computational Strategy to
Define Protein Interaction Networks for Peptide Recognition
- Amy Hin Yan Tong, Becky Drees, Giuliano Nardelli, Gary D.
Bader, Barbara Brannetti, Luisa Castagnoli, Marie Evangelista,
Silvia Ferracuti, Bryce Nelson, Serena Paoluzi, Michele Quondam,
Adriana Zucconi, Christopher W. V. Hogue, Stanley Fields,
Charles Boone, and Gianni Cesareni
Science 2002 295: 321-324.
(in Reports) [Abstract]
Number 5553, Issue of 11 Jan 2002, pp. 284-287.
Copyright © 2002 by The American Association for the
Advancement of Science. All rights reserved.