Mark Gerstein,
Ning Lan & Ronald Jansen
Molecular Biophysics & Biochemistry Dept., Yale University, 266 Whitney Ave., New Haven, CT 06520.
Email: Mark.Gerstein@yale.edu
With genome sequences as intellectual inspiration and practical scaffolds, scientists are increasingly performing experiments on all genes in a genome. [HN1] The integration of the resulting genome-wide information into useful definitions of gene function constitutes a principle challenge for post-genomic biology. Exactly what form such definitions will take is still an open question. Comprehensive networks of protein-protein interactions, interactomes, provide a valuable way of circumscribing and partially defining gene function. [HN2]
On page ??? of this issue, Tong et al. (1) describe a systematic approach for identifying interaction networks for peptide-recognition domains. They break new ground in the way they combine ¡°orthogonal¡± datasets. Specifically, they intersect two different interactomes. The first is derived from screening phage-display peptide libraries to find consensus sequences that bind particular peptide-recognition domains. The resulting network connects proteins containing recognition domains to those containing the consensus. It partially defines binding sites in some of the proteins and represents a novel use of phage-display technology. The second network comes from experimentally testing each peptide-recognition module for association with possible protein-binding partners using the two-hybrid approach. [HN3] Tong et al. apply their approach to determine interacting partners for SH3 domains in yeast. [HN4] These domains make good targets because of their prevalence and involvement in a number of important biological processes, from cytoskeleton reorganization to signal transduction.
The power of Tong et al.'s approach is manifest in interpreting large genomic datasets, particularly in reducing their noise. One of the fallacies in dealing with genomic datasets is ascribing too much meaning to individual datapoints. Many datasets (e.g. expression) contain so much noise that it can be difficult to draw reliable conclusions for specific genes. However, they still have much useful information statistically, in terms of broad trends. To get at this, one must aggregate data. This can be simply achieved by combining replicates of an experiment, but this does not remove systematic errors. Secondly, in somewhat more elaborate fashion, one can collect many individual measurements on different proteins into aggregate "proteomic classes" (e.g. functional categories) and compare these (2). [HN5] Finally, Tong et al.'s study points to the perhaps most powerful approach: interrelating and integrating orthogonal (i.e. fundamentally different) information. In the abstract one can easily demonstrate that combining independent datasets results in a lower error rate overall. For instance, combining three independent binary-type datasets with error rates of 10% (for false positives and negatives) reduces the overall error rate to 2.8% (for positives and negatives) (5). [HN6] Moreover, interrelating two different types of whole-genome data also enables one to discover non-obvious and potentially significant relationships -- e.g. between expression and chromosomal positioning or subcellular localization (3).
There have been a number of attempts at interrelating information from different genomic datasets before, much of it starting from expression experiments. For instance, expression data were initially analyzed by a variety of supervised and unsupervised methods (e.g., hierarchical trees, k-means, self-organizing maps, and support-vector machines) and compared to functional categories (4). [HN5] They were also interrelated with transcription-factor binding sites, protein families, protein-protein interactions, and protein abundance (2,5,6) [HN7]. In a shorthand sense, much of this can be thought of as interrelating the transcriptome with other "omes" -- e.g. proteome, translatome, secretome, and interactome (2) [HN8].
There are considerably fewer examples of the synthesis of more than two types of whole-genome data [HN9]. One initial attempt combined expression correlations, phylogenetic profiles and patterns of domain fusion to predict protein function (7). A Bayesian framework integrated expression, essentiality, and sequence motif data for the prediction of protein subcellular localization (8). Tong et al.'s strategy of overlapping interactomes presents a new type of synthesis. It is particularly effective in that their two datasets are orthogonal in many respects. Phage-display is based on in vitro binding of short peptides, whereas two-hybrid uses in vivo binding and full-length proteins. Moreover, the phage-display network is computationally predicted but uses relatively unambiguous consensus sequences, whereas the two-hybrid one is experimentally derived but suffers from appreciable false positives (9).
From a data-mining standpoint, the heterogeneous character and variable quality of whole-genome information makes integration tricky. Consider combining "orthogonal" interactome datasets, such as attempted by Tong et al., in a general sense. How might one proceed formally? As indicated in the top figure, there are two extremes. On one, the datasets have low false-negative but high false-positive error rates. That is, each experiment almost never misses real interactions but also finds many spurious ones. In this situation the benefit of integration comes from intersection: only interactions common to all are accepted, thus lowering the combined error rate. Tong et al.'s approach fits this to some degree. On the other extreme are datasets with few false positives but low coverage of the space of interactions. The benefit of integration then comes from the union: any interaction found in at least one dataset is accepted. Another earlier interactome analysis followed this to some degree (10). [HN10]
In most practical situations, the optimal way to integrate datasets is somewhere between these extremes. The task is to combine datasets with varying error rates and coverage. Accordingly, the rules for identifying positives become more complicated. Instead of simple unions or intersections, different combinations of positive and negative signals from the datasets should be considered, taking into account their relative false positive and negative rates.
The bottom figure provides a practical illustration of the power of interrelating genomic data for yeast. It shows the degree to which one can find protein-protein associations in known protein complexes (11) based on integrating, in stepwise fashion, increasing amounts of orthogonal genomic information. We start by considering associations that can be found from expression correlations over the cell cycle; then we incorporate those derived from a second but different microarray experiment, giving responses to knockouts (12). Finally, we add associations predicted from genomic measurements of essentiality and localization (8,11,13). As we integrate more information, the total number of correctly identified interactions rises (especially for the union of the predicted associations) [HN11]. Simultaneously, the error rate decreases. Moreover, if we focus just on the intersection of the predicted associations, the error rate falls even more.
A major future challenge will be devising uniform frameworks for integrating information, both from high-throughput and traditional biochemical approaches. One aspect of this will be developing better databases for storing and querying heterogeneous information. In particular, databases will need to be more precise in their treatment of errors and also interface better with journals. Another aspect will be developing datamining approaches to operate on these databases, integrating many different genomic features into results pertinent to biological function. Genomic features can be of very different character (from hundreds of "Booleans" for interactions to tens of thousands of real-number vectors for expression timecourses), and a central issue in integration is determining how to relatively weight each feature. In this regard, some machine-learning techniques, such as Bayesian networks and decision trees, are quite powerful while others are more problematic (e.g. support-vector machines).
Finally, we will also need to come up with a more systematic definition of gene function, the ultimate aim of proteomic investigation. To many scientists, what constitutes "function" is a natural-language phrase or name, often in non-systematic terminology -- e.g. "ATPase" or "suppressor of white apricot" [HN12]. This approach is sufficient for single-molecule work but does not scale to the genomic level. More systematic attempts have been made to place proteins within a hierarchy of standard functional categories or to connect them into overlapping networks of varying types of association (11,14) [HN13]. These networks can obviously include protein-protein interactions, the subject of Tong et al.'s work. More broadly, they can include pathways, regulatory systems, and signaling cascades. How far are we able to go with this approach? Perhaps in the future the systematic combination of networks may provide for a truly rigorous definition of protein function [HN14].
(Top) Two different extremes in integrating interactomes. On left, the combined network is the union of those with low false-positive but high false-negative rates, whereas the combined network on right is the intersection of ones with low false-negative but high false-positive rates. Circles represent proteins; links, interactions; and dotted lines, known associations. (Thicker links indicate lower false-positive rates.) More effective rules for combining networks than union and intersection take into account the different error rates associated with each link type. (Bottom) How integrating progressively more orthogonal information identifies increasingly more associations (5). From the known complexes in yeast there are 8,250 protein-protein associations (11). The y-axis shows the percentage of these identified by disparate genomic data (i.e. coverage). The x-axis shows the progressive addition of genomic data. The first two bars represent the protein associations with most significant expression correlation in two different microarray sets (12). The next two represent adding the associations predicted because both proteins were similarly essential for cell survival or had similar localization (8,13). The shading on the bars roughly indicates false-positive rates throughout the integration. While it is reasonable that associated components of complexes will have correlated expression and similar localization and "essentiality", this is only weakly predictive, generating many spurious positives. Consequently, the "weak-links" case on the top-right applies, and one can see from the shading how intersection lowers the error rate.
1. A.
Tong et al., Science Vol(?), page(?) (2002). (With interactions available from
www.binddb.org.)
2. D.
Greenbaum et al., Genome Res 11, 1463 (2001). R. Jansen, M.
Gerstein, Nucleic Acids Res 28, 1481 (2000).
3. B.
A. Cohen et al. Nat Genet 26, 183. (2000). A. Drawid et al. Trends Genet 16, 426. (2000).
4. P. Tamayo et al., Proc Natl Acad Sci U S A 96, 2907. (1999). M. B. Eisen et al. Proc Natl Acad Sci U S A 95, 14863. (1998). S. Tavazoie et al., Nat Genet 22, 281. (1999). M. Gerstein, R. Jansen, Curr Opin Struct Biol 10, 574. (2000). M. Brown et al., Proc Natl Acad Sci U S A 97, 262. (2000).
5. J. Qian et al. J Mol Biol 314, 1053 (2001). R. Jansen et al., Genome Res. 12, 37 (2002). (These contain details of the approach in figure and calculations with more at genecensus.org/integrate/interactions.)
6. H. Ge et al., Nat Genet 29, 482. (2001). F. Roth et al., Nat Biotechnol 16, 939. (1998). S. Gygi et al., Mol Cell Biol 19, 1720. (1999). B. Futcher et al., Mol Cell Biol 19, 7357. (1999). Brazma et al., Genome Res. 8, 1202 (1998).
7. A.
Drawid, M. Gerstein, J Mol Biol 301, 1059. (2000).
8. E. Marcotte et al., Science 285, 751. (1999). E. Marcotte et al., Nature 402, 83. (1999).
9. T.
Ito et al., Proc Natl Acad Sci U S A 98, 4569. (2001). P. Uetz et al., Nature 403, 623. (2000).
10. Schwikowski
et al. Nat Biotechnol 18, 1257 (2000).
11. H.
W. Mewes, et al., Nucleic Acids Res 28, 37-40. (2000).
12. R.
J. Cho et al., Mol Cell 2, 65-73. (1998). T. R. Hughes et al., Cell 102, 109-26. (2000).
13. E. A. Winzeler et al., Science 285, 901-6. (1999). P. Ross-Macdonald et al., Nature 402, 413-8. (1999).
14. D. Eisenberg et al., Nature 405, 823. (2000). M. Ashburner et al., Nat Genet 25, 25. (2000).
Michael B. Eisen at Berkeley, microarray analysis.
Church Lab at Harvard University, quantitative whole genome and proteome measures in computational modeling of regulatory and enzymatic networks.
Michael Bittner at NHRGI, hybridization-based analytical tools.
Tim Hughes at the University of Toronto, microarray expression analysis
Philip Green at Department of Genome Sciences, University of Washington, computational methods in functional genomics.
Aebersold Protein Laboratory at University of Washington, quantitative protein expression profiles.
Richard Young at MIT, Regulation of Gene Expression.
Michael Q. Zhang at Cold Spring Harbor Laboratory, a series of prediction tools.
Lincoln Stein Laboratory at Cold Spring Harbor Laboratory, various softwares for computational biology and genome bioinformatics.
Ruben
Abagyan at the Scripps
Research Institute, molecular modeling,
bioinformatics
and drug design
& flexible docking.
Burkhard
Rost at Columbia
University, protein structure.
David
Baker at University of Washington, Protein folding.
Helen
Berman, president of American Crystallographic Association, protein structure and fuinction.
Phil Bourne Senior Principal Scientist at the San Diego Supercomputer Center, structural
bioinformatics.
Stephen H.
Bryant at NCBI, Comparative
macromolecular 3D-structure analysis.
Fred Cohen at Cellular & Molecular Pharmacology Department
at the UCSF,
protein
folding.
Daniel Fischer at Ben
Gurion University, Computational Structural Molecular
Biology.
Richard
A. Friesner at Columbia University, novel ab
initio electronic structure methods.
Christopher Hogue at University of Toronto, Structural Bioinformatics.
Liisa Holm at Structural Genomics Group at EMBL-EBI, protein structure.
Barry Honig at Columbia
University, computational biophysics and
bioinformatics.
Sung-Hou Kim
at Department of Chemistry of UC
Berkeley , Structural Genomics.
Irwin D. Kuntz at UCSF, structure based ligand design and docking algorithms.
Guy
Montelione at Rutgers, Protein
NMR, structural genomics, proteomics, structural bioinformatics.
Janet Thornton at UCL, structural proteomics.
Eugene Shakhnovich at Harvard University, protein folding and de novo drug design.
Harold
A. Scheraga at Cornell, protein folding.
Andrej
Sali Lab at Rockefeller University, Protein Structure Theory.
George Rose at Johns Hopkins University, protein folding and structure.
J. Andrew McCammon at UCSD, theoretical chemistry and biochemistry.
Benoit Roux at Cornell, structure and function of ion channels.
Harel
Weinstein at Mount Sinai School of medicine,
Molecular recognition
and signal transduction.
Bonnie Berger head of the Computation and Biology group at MIT, mathematical techniques in protein folding and genomics.
Mark Borodovsky at Georgia Institute of Technology, machine
learning algorithms and biomolecular
structure-function relationships
B. S. Weir, Director of the Bioinformatics Research Center at North Carolina State University, statistical methodology for the interpretation of genetic data.
Edward H. Shortliffe at Columbia University, integrated decision-support systems in medical informatics.
Jun Liu at Department
of Statistics, Harvard University,
statistical methodology.
Gary
Churchill at the Jackson Laboratory, Statistical
Genetics.
Jonathan A. Eisen at TIGR, microbial genomics.
Imran Shah at Department of Computer Science & Engineering, University of Colorado at Denver, bioinformatic methods for integrating complex biological data.
Jurg Ott at Department of Genetics and Development, mathematical-statistical methods for human gene mapping.
Michael A. Newton, Department of Statistics and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, empirical Bayesian methods in expression analysis.
Steven Salzberg, director of Bioinformatics, The Institute for Genomic Research, microbial genomics and machine learning.
Casimir Kulikowski at Department of Computer Science, Rutgers, knowledge-based problem solving methods, clustering, and predictive data mining for bioinformatics.
Mona Singh at MIT, machine learning and algorithms in computational biology.
Tamar Schlick at MIT, Mathematical Biology.
Kevin Karplus at UCSC
compbio group , computational
biology.
The Brutlag Bioinformatics Group at Biochemistry Department, Stanford University, prediction of structure and function from primary sequence.
Steven Benner at University of Florida, bioinformatics
tools.
Chris Burge at MIT, RNA splicing.
Gary
Stormo at Department of Genetics
at Washington University School of Medicine, protein DNA interactions, RNA
structure prediction
Larry Hunter at Center for Computational Pharmacology, University of Colorado at Denver, drug design through the use of high throughput molecular biology data.
National Resource for Cell Analysis and Modeling- Home of the Virtual Cell.
Molecular Pattern Recognition Group at the Whitehead/MIT Center for Genome Research.
The Sanger Centre :
Human Genome Project
Caltech Initiative in Computational Molecular Biology.
The Program in Computational
Biology at Johns Hopkins University.
Computational Molecular Biology Group at Hebrew University.
Penn State - Bioinformatics Group
Program in Proteomics and Bioinformatics at the University of Toronto.
Theoretical & Computational Biology Group at MRC.
UPenn. Center for Bioinformatics
Virtual Genome Center at University of Minnesota.
Eric Lander, Genome Center Director at Whitehead Institute
George Weinstock, co-director of Human Genome Sequencing Center at Baylor College of Medicine, functional genomics and informatics.
Complete S.cerevisiae Genome at NCBI.
YPD: Yeast Protein Database at Proteome, Inc.
SGD: Saccharomyces Genome Database at Stanford.
Sacch3D Structural Database at Stanford.
S. cerevisiae Promotor Database at Cold Spring Harbor.
Yeast Intron Sequences at University of California Santa Cruz.
Yeast Resource Center at the University of Washington.
Yeast Gene Duplications at Irish National Centre for BioInformatics.
The Saccharomyces Genome Deletion Project aims to generate a complete set of yeast deletion strains using a PCR-based gene deletion strategy, in order to assign function to the ORFs through phenotypic analysis of the mutants.
TRIPLES Database
developed by Yale Center for Medical
Informatics, provides public access to disruption phenotype,
localization and expression data
generated in
the transposon-tagging study
of the yeast genome.
Yeast Cluster Analysis Server provides graphic views of statistical data in yeast genome analysis.
Comparison of the worm and yeast genomes in terms of folds
Comprehensive reports on yeast orfome focuses on the probability of each ORF being a real gene.
Pfam, a large
database of protein domain family alignments, annotation and profile HMMs, is lead by Alex Bateman at the Wellcome
Trust Sanger Institute.
Cancer Genome Anatomy Project (CGAP), NCI program to study molecular changes in cancer.
Columbia
University Microarray Project (CUMAP), advancing microarray technologies.
SPINE, tracking database for NESGC, and data
mining approach for identifying feasible targets in high-throughput structural
proteomics.
A glossary of genetic terms is provided by the National Human Genome Research Institute (NHGRI).
Genetics Glossary from the UK -- http://helios.bto.ed.ac.uk/bto/glossary/
Online Courses at the International Society for Computational Biology.
2001
Primer on Molecular Genetics from the U.S.
Department of Energy.
Bioinformatics
FAQ at Harvard Research Computing Center.
M.Gerstein, Department of Molecular Biophysics & Biochemistry and Computer Science, and Yale Bioinformatics Research Group, Yale University, provides lecture notes for a course on Genomics & Bioinformatics.
An elementary, step-by step Genetics Tutorial at Rutgers University.
A gentle introduction course to Bioinformatics at University of Colorado at Denver.
ABIM Guides, Tutorials in Biology
Bioinformatics and Genomics course at UCLA.
Bioinformatics Applications Course at University of Georgia.
Statistical Methods in Bioinformatics : An Introduction.
Computational Genomics, taught by William Noble Grundy at Columbia University.
Computational Techniques for Genomics course at University of Minnesota.
Data Mining Lecture Notes by Jeffrey Ullman at Stanford University.
Representations and Algorithms for Computational Molecular Biology, taught by Russ Altman at Stanford University.
DIMACS Workshop on Integration of Diverse Biological Data.
Computational Molecular Biology course at Stanford University.
Genomics and Computational Biology, taught by George Church at Harvard University.
Algorithms for Molecular Biology, taught by Ron Shamir at Tel Aviv University.
Computational Molecular Biology at Brandeis University.
Topics in Genomics and Computational Biology, taught by Steven Brenner and Michael Eisen at Berkeley.
Computational Biology Seminar at University of Washington.
BLAST: a set of
protein or DNA similarity search programs designed to explore all of the
available sequence databases in Genbank.
FastA: compares a protein or DNA sequence to another sequence or a database.
HMMer: performs profile hidden Markov models(HMM) to do sensitive database searching using statistical descriptions of a sequence family's consensus.
SAM: Sequence Alignment and Modeling system, based on HMM and Dirichlet mixtures
CLUSTAL-W: a general purpose multiple alignment program for DNA or proteins.
MiNK at Columbia Genome Center is part of the Columbia University Microarray Project.
Online Machine Learning Resources, maintained by the ML Group at the Austrian Research Institute for Artificial Intelligence (OFAI)
Bayesware Limited, knowledge discovery and data mining software
based on Bayesian methods.
Software
Packages for Graphical Models / Bayesian Networks, a list of softwares with
a guide to their applications.
SVM-Light, an implementation of Support Vector Machine, developed by Thorsten Joachims at Cornell University.
svm: a simple software system for performing support vector machine binary classification, developed by William Noble Grundy.
Entrez Utilities, a series of programs that retrieve information from NCBI databases.
PubMed Central Home, a digital archive of life sciences journal literature.
A Distributed Annotation System, center of development of an Open Source system for exchanging annotations on genomic sequence data.
bioxml.org, a center of development for open source biological dtds.
DNA Structural Analysis Sequenced Genomes, a DNA Structural Atlas developed at Center for Biological Sequence Analysis, Technical University of Denmark.
euGenes, a common summary of gene and genomic information from eukaryotic organism databases.
The Phred/Phrap/Consed System, for sequence search and shotgun sequence assembly.
PHYLIP, free package of programs for inferring phylogenies.
ProtoMap - automatic hierarchical classification of proteins
Repeat Masker Server, screens DNA sequences for interspersed repeats and low complexity DNA sequences.