Proteomics: Net Profit from Integrating Interactomes

 

Mark Gerstein, Ning Lan & Ronald Jansen

Molecular Biophysics & Biochemistry Dept., Yale University, 266 Whitney Ave., New Haven, CT 06520.

Email: Mark.Gerstein@yale.edu

 

With genome sequences as intellectual inspiration and practical scaffolds, scientists are increasingly performing experiments on all genes in a genome. [HN1] The integration of the resulting genome-wide information into useful definitions of gene function constitutes a principle challenge for post-genomic biology.  Exactly what form such definitions will take is still an open question. Comprehensive networks of protein-protein interactions, interactomes, provide a valuable way of circumscribing and partially defining gene function. [HN2]

 

On page ??? of this issue, Tong et al. (1) describe a systematic approach for identifying interaction networks for peptide-recognition domains.  They break new ground in the way they combine ¡°orthogonal¡± datasets.  Specifically, they intersect two different interactomes.  The first is derived from screening phage-display peptide libraries to find consensus sequences that bind particular peptide-recognition domains.  The resulting network connects proteins containing recognition domains to those containing the consensus. It partially defines binding sites in some of the proteins and represents a novel use of phage-display technology. The second network comes from experimentally testing each peptide-recognition module for association with possible protein-binding partners using the two-hybrid approach. [HN3] Tong et al. apply their approach to determine interacting partners for SH3 domains in yeast. [HN4] These domains make good targets because of their prevalence and involvement in a number of important biological processes, from cytoskeleton reorganization to signal transduction.

 

The power of Tong et al.'s approach is manifest in interpreting large genomic datasets, particularly in reducing their noise.  One of the fallacies in dealing with genomic datasets is ascribing too much meaning to individual datapoints.  Many datasets (e.g. expression) contain so much noise that it can be difficult to draw reliable conclusions for specific genes. However, they still have much useful information statistically, in terms of broad trends.  To get at this, one must aggregate data.  This can be simply achieved by combining replicates of an experiment, but this does not remove systematic errors. Secondly, in somewhat more elaborate fashion, one can collect many individual measurements on different proteins into aggregate "proteomic classes" (e.g. functional categories) and compare these (2). [HN5] Finally, Tong et al.'s study points to the perhaps most powerful approach: interrelating and integrating orthogonal (i.e. fundamentally different) information.  In the abstract one can easily demonstrate that combining independent datasets results in a lower error rate overall.  For instance, combining three independent binary-type datasets with error rates of 10% (for false positives and negatives) reduces the overall error rate to 2.8% (for positives and negatives) (5). [HN6] Moreover, interrelating two different types of whole-genome data also enables one to discover non-obvious and potentially significant relationships -- e.g. between expression and chromosomal positioning or subcellular localization (3).

 

There have been a number of attempts at interrelating information from different genomic datasets before, much of it starting from expression experiments.  For instance, expression data were initially analyzed by a variety of supervised and unsupervised methods (e.g., hierarchical trees, k-means, self-organizing maps, and support-vector machines) and compared to functional categories (4). [HN5] They were also interrelated with transcription-factor binding sites, protein families, protein-protein interactions, and protein abundance (2,5,6) [HN7]. In a shorthand sense, much of this can be thought of as interrelating the transcriptome with other "omes" -- e.g. proteome, translatome, secretome, and interactome (2) [HN8].

 

There are considerably fewer examples of the synthesis of more than two types of whole-genome data [HN9].  One initial attempt combined expression correlations, phylogenetic profiles and patterns of domain fusion to predict protein function (7). A Bayesian framework integrated expression, essentiality, and sequence motif data for the prediction of protein subcellular localization (8). Tong et al.'s strategy of overlapping interactomes presents a new type of synthesis.  It is particularly effective in that their two datasets are orthogonal in many respects.  Phage-display is based on in vitro binding of short peptides, whereas two-hybrid uses in vivo binding and full-length proteins. Moreover, the phage-display network is computationally predicted but uses relatively unambiguous consensus sequences, whereas the two-hybrid one is experimentally derived but suffers from appreciable false positives (9).

 

From a data-mining standpoint, the heterogeneous character and variable quality of whole-genome information makes integration tricky. Consider combining "orthogonal" interactome datasets, such as attempted by Tong et al., in a general sense. How might one proceed formally? As indicated in the top figure, there are two extremes. On one, the datasets have low false-negative but high false-positive error rates. That is, each experiment almost never misses real interactions but also finds many spurious ones. In this situation the benefit of integration comes from intersection: only interactions common to all are accepted, thus lowering the combined error rate. Tong et al.'s approach fits this to some degree. On the other extreme are datasets with few false positives but low coverage of the space of interactions.  The benefit of integration then comes from the union: any interaction found in at least one dataset is accepted.  Another earlier interactome analysis followed this to some degree (10). [HN10]

 

In most practical situations, the optimal way to integrate datasets is somewhere between these extremes.  The task is to combine datasets with varying error rates and coverage. Accordingly, the rules for identifying positives become more complicated.  Instead of simple unions or intersections, different combinations of positive and negative signals from the datasets should be considered, taking into account their relative false positive and negative rates.

 

The bottom figure provides a practical illustration of the power of interrelating genomic data for yeast.  It shows the degree to which one can find protein-protein associations in known protein complexes (11) based on integrating, in stepwise fashion, increasing amounts of orthogonal genomic information.  We start by considering associations that can be found from expression correlations over the cell cycle; then we incorporate those derived from a second but different microarray experiment, giving responses to knockouts (12). Finally, we add associations predicted from genomic measurements of essentiality and localization (8,11,13). As we integrate more information, the total number of correctly identified interactions rises (especially for the union of the predicted associations) [HN11]. Simultaneously, the error rate decreases. Moreover, if we focus just on the intersection of the predicted associations, the error rate falls even more.

 

A major future challenge will be devising uniform frameworks for integrating information, both from high-throughput and traditional biochemical approaches. One aspect of this will be developing better databases for storing and querying heterogeneous information. In particular, databases will need to be more precise in their treatment of errors and also interface better with journals. Another aspect will be developing datamining approaches to operate on these databases, integrating many different genomic features into results pertinent to biological function. Genomic features can be of very different character (from hundreds of "Booleans" for interactions to tens of thousands of real-number vectors for expression timecourses), and a central issue in integration is determining how to relatively weight each feature.  In this regard, some machine-learning techniques, such as Bayesian networks and decision trees, are quite powerful while others are more problematic (e.g. support-vector machines).  

 

Finally, we will also need to come up with a more systematic definition of gene function, the ultimate aim of proteomic investigation.  To many scientists, what constitutes "function" is a natural-language phrase or name, often in non-systematic terminology -- e.g. "ATPase" or "suppressor of white apricot" [HN12].  This approach is sufficient for single-molecule work but does not scale to the genomic level.  More systematic attempts have been made to place proteins within a hierarchy of standard functional categories or to connect them into overlapping networks of varying types of association (11,14) [HN13].  These networks can obviously include protein-protein interactions, the subject of Tong et al.'s work. More broadly, they can include pathways, regulatory systems, and signaling cascades. How far are we able to go with this approach?  Perhaps in the future the systematic combination of networks may provide for a truly rigorous definition of protein function [HN14].

 

 

Overlapping Nets.

(Top) Two different extremes in integrating interactomes.  On left, the combined network is the union of those with low false-positive but high false-negative rates, whereas the combined network on right is the intersection of ones with low false-negative but high false-positive rates.  Circles represent proteins; links, interactions; and dotted lines, known associations. (Thicker links indicate lower false-positive rates.) More effective rules for combining networks than union and intersection take into account the different error rates associated with each link type. (Bottom) How integrating progressively more orthogonal information identifies increasingly more associations (5). From the known complexes in yeast there are 8,250 protein-protein associations (11). The y-axis shows the percentage of these identified by disparate genomic data (i.e. coverage). The x-axis shows the progressive addition of genomic data. The first two bars represent the protein associations with most significant expression correlation in two different microarray sets (12). The next two represent adding the associations predicted because both proteins were similarly essential for cell survival or had similar localization (8,13). The shading on the bars roughly indicates false-positive rates throughout the integration. While it is reasonable that associated components of complexes will have correlated expression and similar localization and "essentiality", this is only weakly predictive, generating many spurious positives. Consequently, the "weak-links" case on the top-right applies, and one can see from the shading how intersection lowers the error rate.

 

References

 

1.              A. Tong et al., Science Vol(?), page(?) (2002). (With interactions available from www.binddb.org.)

2.              D. Greenbaum et al., Genome Res 11, 1463 (2001). R. Jansen, M. Gerstein, Nucleic Acids Res 28, 1481 (2000).

3.              B. A. Cohen et al. Nat Genet 26, 183. (2000).  A. Drawid et al. Trends Genet 16, 426. (2000).

4.                           P. Tamayo et al., Proc Natl Acad Sci U S A 96, 2907. (1999). M. B. Eisen et al. Proc Natl Acad Sci U S A 95, 14863. (1998). S. Tavazoie et al., Nat Genet 22, 281. (1999). M. Gerstein, R. Jansen, Curr Opin Struct Biol 10, 574. (2000). M. Brown et al., Proc Natl Acad Sci U S A 97, 262. (2000).

5.            J. Qian et al. J Mol Biol 314, 1053 (2001). R. Jansen et al., Genome Res. 12, 37 (2002). (These contain details of the approach in figure and calculations with more at genecensus.org/integrate/interactions.)

6.              H. Ge et al., Nat Genet 29, 482. (2001). F. Roth et al., Nat Biotechnol 16, 939. (1998). S. Gygi et al., Mol Cell Biol 19, 1720. (1999). B. Futcher et al., Mol Cell Biol 19, 7357. (1999). Brazma et al., Genome Res. 8, 1202 (1998).

7.              A. Drawid, M. Gerstein, J Mol Biol 301, 1059. (2000).

8.              E. Marcotte et al., Science 285, 751. (1999). E. Marcotte et al., Nature 402, 83. (1999).

9.              T. Ito et al., Proc Natl Acad Sci U S A 98, 4569. (2001). P. Uetz et al., Nature 403, 623. (2000).

10.           Schwikowski et al. Nat Biotechnol 18, 1257 (2000).

11.           H. W. Mewes, et al., Nucleic Acids Res 28, 37-40. (2000).

12.           R. J. Cho et al., Mol Cell 2, 65-73. (1998). T. R. Hughes et al., Cell 102, 109-26. (2000).

13.           E. A. Winzeler et al., Science 285, 901-6. (1999). P. Ross-Macdonald et al., Nature 402, 413-8. (1999).

14.           D. Eisenberg et al., Nature 405, 823. (2000). M. Ashburner et al., Nat Genet 25, 25. (2000).

 

General Hypernotes

Focusing on Functional genomics

Michael B. Eisen at Berkeley, microarray analysis.

Church Lab at Harvard University, quantitative whole genome and proteome measures in computational modeling of regulatory and enzymatic networks.

Michael Bittner at NHRGI, hybridization-based analytical tools.

Tim Hughes at the University of Toronto, microarray expression analysis

Lincoln Stein Laboratory at Cold Spring Harbor Laboratory, various softwares for computational biology and genome bioinformatics.

Focusing on Structural genomics

J. Andrew McCammon at UCSD, theoretical chemistry and biochemistry.

Benoit Roux at Cornell, structure and function of ion channels.

Harel Weinstein at Mount Sinai School of medicine, Molecular recognition and signal transduction.

Focusing on the Development of Computational algorithms

Mark Borodovsky at Georgia Institute of Technology, machine learning algorithms and biomolecular structure-function relationships

B. S. Weir, Director of the Bioinformatics Research Center at North Carolina State University, statistical methodology for the interpretation of genetic data.

Gary Churchill at the Jackson Laboratory, Statistical Genetics.

Imran Shah at Department of Computer Science & Engineering, University of Colorado at Denver, bioinformatic methods for integrating complex biological data.

Michael A. Newton, Department of Statistics and Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, empirical Bayesian methods in expression analysis.

Steven Salzberg, director of Bioinformatics, The Institute for Genomic Research, microbial genomics and machine learning.

Casimir Kulikowski at Department of Computer Science, Rutgers, knowledge-based problem solving methods, clustering, and predictive data mining for bioinformatics.

Mona Singh at MIT, machine learning and algorithms in computational biology.

Tamar Schlick at MIT, Mathematical Biology.

Kevin Karplus at  UCSC compbio group , computational biology.

The Brutlag Bioinformatics Group at Biochemistry Department, Stanford University, prediction of structure and function from primary sequence.

Steven Benner at University of Florida, bioinformatics tools.

Chris Burge at MIT, RNA splicing.

Gary Stormo at Department of Genetics at Washington University School of Medicine, protein DNA interactions, RNA structure prediction

Research centers

The Program in Computational Biology at Johns Hopkins University.

Eric Lander, Genome Center Director at Whitehead Institute

George Weinstock, co-director of Human Genome Sequencing Center at Baylor College of Medicine, functional genomics and informatics.

Genomics Resources (particularly for yeast)

Courses and educational resources relevant to Bioinformatics & Genomics

General

Computational genomics

Bioinformatics tools

Sequence search and alignment

Data Mining Tools

Other

Numbered Hypernotes

  1. As of December 2001, GenBank repository of nucleic acid sequences contained over 800 whole genome sequences. Structural Genomics Initiatives aim at large-scale determination of protein structures. Function genomics elucidate function of all the proteins in a genome, typically through mRNA expression, protein interaction assays, gene disruption or proteome microarray.
  2. The Biomolecular Interaction Network Database (BIND) is a database designed to store full descriptions of interactions, molecular complexes and pathways. As of December 2001, it stored records of 5940 interactions, 54 complexes and 7 pathways. The C.elegans interactome project lea by Mark Vidal at Harvard University aims to generate a comprehensive protein-protein interaction map for C. elegans. AxCell's ProChart(TM) database is currently one of the most comprehensive databases of human protein interactions. The Proteomic Pathway Project at BIOCARTA provides dynamic graphic models for gene interaction. Kohn Molecular Interaction Maps designed by Kurt W. Kohn is a molecular interaction map of the mammalian cell cycle control and DNA repair systems and amenable for further development.
  3. The Y2H system was first established by Song and Fields (1989). Brent¡¯s lab at Harvard University has developed a nice variant of the system and Elledge's lab at Baylor College of Medicine has contributed a lot to this technology.
  4. The SH3 domain is one of the best-characterized members of the growing family of protein-interaction modules. The basic fold contains five anti-parallel b-strands packed to form two perpendicular b-sheets. It is catalogued as entry IPR001452 in Interpro.
  5. Standard methods for expression data include hierarchical clustering, k-means clustering, self-organizing maps and principle component analysis. Inge Jonassen's web site provides references on these methods. Expression analysis software has been developed using various combinations of these methods, such as Cluster and TreeView, EPCLUS, CLEAVER, etc. One new combination method is local clustering, which finds time-shifted and inverted relationships. Expression databases include GEO (NCBI gene expression omnibus), ExpressDB (Harvard), GeneX (NCGR), Stanford Microarray Database and ArrayExpress (EBI). A comprehensive list of links to expression databases and softwares is provided at European Bioinformatics Institute (EBI).
  6. This estimate is built from a simple Bayesian network model of three binary-type datasources, with random noise. An example of binary-type data is protein-protein interactions ("YES, there is an interaction" or "NO, there isn't"). We give a detailed discussion the calculation with a spreadsheet at genecensus.org/integrate/interactions .
  7. There have been efforts to correlate expression data with transcriptional regulation, functional categories, protein folds and families, and protein abundance; map between transcriptome and interactome, integrate sequence and expression data to predict subcellular localization, and combine whole-genome data to infer functional linkages of proteins. GenMAPP at UCSF was designed to visualize gene expression data on maps representing biological pathways. Biomolecular Relations in Information Transmission and Expression (BRITE) is a database of binary relations involving genes and proteins, including interactions that underlie the KEGG pathway diagrams, protein-protein interactions by Y2H systems, sequence similarity relations by SSEARCH and expression similarity relations by microarray gene expression profiles.
  8. A myriad of terms suffixed with ¡°ome¡± have been coined to define the varied cellular populations and subpopulations, ranging from the most well-known Genome, which first appeared in Pubmed in 1932, to the newly coined terms that have yet to gain wide recognition, such as Foldome or Transportome.
  9. Eisenberg et al.defined functional protein networks combining experimentally determined interactions from the DIP with those computationally predicted from protein phylogenetic profiles and expression data. The union of various datasets was counted while higher reliability was given to interactions predicted by two or more methods. The Allfuse Database developed by the Computational Genomics Group at EBI aims to find functional associations of proteins in complete genomes though gene fusion.
  10. Schwikowski et al.obtained a single large network from the union of recorded interactions in MIPS and the Database of Interacting Proteins (DIP), and data from Y2H experiments carried out by Ito et al. and Uetz et al, and thereby computed a map of interactions between functional groups.
  11. Further details of this graph are available at genecensus.org/integrate/interactions (alternate site). Four datasets were integrated; (i) Yeast Cell Cycle Data (timecourse expression data); (ii) Rosetta knockout data at ExpressDB (non-timecourse expression data); (iii) Essentiality data comes from MIPS catalogs and (in conditional form) from Michael Snyder's phenoytpe-profiling transposon experiments; (iv) Localization data comes from a localization website (set #4), which merges data from MIPS, Swissprot, and the Snyder lab. YPD also has lots of localization data.
  12. The name of genes which were given by their discoverers may not be related to their function, and sometimes can be misleading.
  13. Hierarchical system to represent protein function for single organisms include MIPS functional classification catalog (yeast), GenProtEC (E. coli), FlyBase (Drosophila) and EGAD (human ESTs). Gene Ontology is a common source where functional classification for multiple organisms merged. The Enzyme system classifies enzyme function. MetaCyc: Metabolic Encyclopedia, WIT and KEGG describe gene function in terms of metabolic pathways and reactions. Protein function can also be inferred from interaction maps available at DIP, Pronet Database and the Biomolecular Interaction Network Database.
  14. There are many groups that are currently working on inferring and defining protein function. Some representative sites: Steven E. Brenner at Berkeley; Janet Thornton at UCL; a pioneer in structure-function relationships, an integrated function site under construction at Yale, Barry Honig at Columbia; Andrey Rzhetsky and Bill Noble (with Paul Pavlidis) at the Columbia Genome Center; David Eisenberg at UCLA; and Eugene V. Koonin at the NCBI.