Science -- Jansen et al. 302 (5644): 449

Institution: YALE UNIVERSITY | Sign In as Individual | FAQ | Access Rights | Join AAAS

		Abstract of this Article
		PDF Version of this Article
		Supporting Online Material

		Download to Citation Manager
		Alert me when: new articles cite this article

		Search for similar articles in: Science Online PubMed
		Search Medline for articles by: Jansen, R. \|\| Gerstein, M.

		This article appears in the following Subject Collections: Cell Biology

A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data

Ronald Jansen,¹^* Haiyuan Yu,¹ Dov Greenbaum,¹ Yuval Kluger,¹ Nevan J. Krogan,⁴ Sambath Chung,¹^,2 Andrew Emili,⁴ Michael Snyder,² Jack F. Greenblatt,⁴ Mark Gerstein¹^,3

We have developed an approach using Bayesian networks to predictprotein-protein interactions genome-wide in yeast. Our methodnaturally weights and combines into reliable predictions genomicfeatures only weakly associated with interaction (e.g., messengerRNAcoexpression, coessentiality, and colocalization). In additionto de novo predictions, it can integrate often noisy, experimentalinteraction data sets. We observe that at given levels of sensitivity,our predictions are more accurate than the existing high-throughputexperimental data sets. We validate our predictions with TAP(tandem affinity purification) tagging experiments. Our analysis,which gives a comprehensive view of yeast interactions, is availableat genecensus.org/intint.

¹ Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, Post Office Box 208114, New Haven, CT 06520, USA.
² Department of Molecular, Cellular and Developmental Biology, Yale University, 266 Whitney Avenue, Post Office Box 208114, New Haven, CT 06520, USA.
³ Department of Computer Science, Yale University, 266 Whitney Avenue, Post Office Box 208114, New Haven, CT 06520, USA.
⁴ Banting and Best Department of Medical Research, Department of Molecular and Medical Research, University of Toronto, Toronto, M5G 1L6, Ontario, Canada.

^* Present address: Computational Biology Center, Memorial Sloan-KetteringCancer Center, 307 West 63rd Street, New York, NY 10021, USA.

To whom correspondence should be addressed. E-mail: mark.gerstein@yale.edu

Many fundamental biological processes involve protein-proteininteractions, and comprehensively identifying them is importantto systematically defining their cellular role. New experimentaland computational methods have vastly increased the number ofknown or putative interactions, cataloged in databases (1–7).Much genomic information also relates to interactions indirectly:Interacting proteins are often significantly coexpressed (asshown by microarrays) and colocalized to the same subcellularcompartment (8, 9).

Unfortunately, interaction data sets are often incomplete andcontradictory (10–12). In the context of genome-wide analyses,these inaccuracies are greatly magnified because the proteinpairs that do not interact (negatives) far outnumber those thatdo (positives). For instance, in yeast, the 6000 proteins allowfor 18 million potential interactions, but the estimated numberof actual interactions is <100,000 (10, 13, 14). Thus, evenreliable techniques can generate many false positives when appliedgenome-wide. This is similar to a diagnostic with a 1% false-positiverate for a rare disease occurring in 0.1% of the population,which would roughly produce one true positive for every 10 falseones. Further information is necessary.

Consequently, when evaluating protein-protein interactions,one needs to integrate evidence from many different sources(15–17). Here, we propose a Bayesian approach for integratinginteraction information that allows for the probabilistic combinationof multiple data sets and demonstrate its application to yeast(18). Our approach can be used for combining noisy interactiondata sets and for predicting interactions de novo, from othergenomic information. The basic idea is to assess each sourceof evidence for interactions by comparing it against samplesof known positives and negatives ("gold-standards"), yieldinga statistical reliability. Then, extrapolating genome-wide,we predict the chance of possible interactions for every proteinpair by combining each independent evidence source accordingto its reliability. We verified our predictions by comparingthem against existing experimental interaction data (not inthe gold-standard) as well as new TAP (tandem affinity purification)tagging experiments.

Among the many possible machine-learning approaches that couldbe applied to predicting interactions (ranging from simple unionsand intersections of data sets to neural networks, decisiontrees, and support-vector machines), Bayesian networks haveseveral advantages (19): They allow for combining highly dissimilartypes of data (i.e., numerical and categorical), convertingthem to a common probabilistic framework, without unnecessarysimplification; they readily accommodate missing data; and theynaturally weight each information source according to its reliability.In contrast to "black-box" predictors, Bayesian networks arereadily interpretable as they represent conditional probabilityrelationships among information sources.

The gold-standard data set on which we train ("parameterize")the Bayesian network should ideally be (i) independent fromthe data sources serving as evidence, (ii) sufficiently largefor reliable statistics, and (iii) free of systematic bias.We used the MIPS (Munich Information Center for Protein Sequences)complexes catalog as the gold-standard for positives (6). Thishand-curated list of proteincomplexes is based on the literature[8250 pairs in our filtered version (19)]. A negatives gold-standardis harder to define, but essential for successful training.Thus, we synthesized negatives from lists of proteins in separatesubcellular compartments (9). Our positive and negative gold-standardssatisfy the first two criteria and provide a good practicalsolution for the third. Hence, our goal, precisely defined,was to predict whether two proteins are in the same complex,not whether they necessarily had direct physical contact.

As a measure of reliability, the overlap of information sources(i.e., "interaction data sets," which could either be noisyexperimental data or sets of genomic features) with the gold-standardscan be expressed in terms of a "likelihood ratio." For example,consider a genomic feature f expressed in binary terms (i.e.,"present" or "absent"). The likelihood ratio L(f) is then definedas the fraction of gold-standard positives having feature fdivided by the fraction of negatives having f. For two featuresf₁ and f₂ with uncorrelated evidence, the likelihood ratio ofthe combined evidence is simply the product L(f₁, f₂) = L(f₁)L(f₂).For correlated evidence, L(f₁, f₂) cannot be factorized in thisway. Bayesian networks are a formal representation of such relationshipsbetween features. The combined likelihood ratio is proportionalto the estimated odds that two proteins are in the same complex,given multiple sources of information.

We predict a protein pair as positive if its combined likelihoodratio exceeds a particular cutoff (L > L_cut) (negative otherwise).To get an overall assessment of how the prediction performs,we segmented the gold-standard into separate training and testingsets (using a sevenfold cross-validation protocol). Then weevaluated the number of true- (TP) and false-positive (FP) predictionsin the testing set. Finally, we applied the Bayesian networkbeyond the testing set, computing likelihood ratios for allpossible protein pairs in the genome.

Figure 1 schematically shows the information sources and resultsof our calculations. We term the results "probabilistic interactomes"(PIs), in which each protein pair is associated with a probabilitymeasure for being in the same complex (i.e., likelihood ratioL). Our procedure not only allows combining existing experimentalinteraction data sets (resulting in a PI-experimental or "PIE"),but also the de novo prediction of protein complexes from genomicdata sets (when the input data are not interaction data setsper se, resulting in a PI-predicted or "PIP").

Fig. 1. The information sources integrated in our analysis and their comparison with each other. (A) The three different types of data used: (i) Interaction data from high-throughput experiments. These comprise large-scale two-hybrid screens (Y2H) (1, 2) and in vivo pull-down experiments (3, 4). (ii) Other genomic features. We considered expression data, biological function of proteins (from Gene Ontology biological process and the MIPS functional catalog), and data about whether proteins are essential (6, 19–22). (iii) Gold-standards of known interactions and noninteracting protein pairs. (The MIPS functional catalog differs from the MIPS complexes catalog used for the gold-standard.) (B) Combination of data sets into probabilistic interactomes. (C) Comparison of the probabilistic interactomes with the gold-standards and our new experimental data. Numbers next to the arrows indicate which figures refer to these various comparisons. [View Larger Version of this Image (25K GIF file)]

We combined four interaction data sets from high-throughputexperiments into the PIE (1–4) (Fig. 1B). The PIE representsa transformation of the individual binary-valued interactionsets into a data set where every protein pair is weighted accordingto the likelihood that it exists within a complex.

We computed the PIP from several genomic data sources: the correlationof mRNA amounts in two expression data sets (one with temporalprofiles during the cell cycle, one of expression levels under300 cellular conditions), two sets of information on biologicalfunction, and information about whether proteins are essentialfor survival (6, 20–22). Although none of these informationsources are interaction data per se, they contain informationweakly associated with interaction: Two subunits of the sameprotein complex often have coregulated mRNA expression and similarbiological functions and are more likely to be both essentialor nonessential (8).

For computing the PIE and the PIP, we used two different typesof Bayesian networks: a "naïve" network for the PIP anda fully connected one for the PIE (19). The naïve networkis simpler to compute but requires information sources withessentially uncorrelated evidence. In contrast, the fully connectedBayesian network accommodates correlated evidence, which isthe case for the four experimental interaction data sets.

Finally, we combined the PIP, PIE, and gold-standard into atotal PI (PIT), which represents our most comprehensive viewof the known and putative protein complexes in yeast (23). Becausethe PIP and PIE data provide essentially uncorrelated evidencefor protein-protein interactions, we chose a naïve networkto construct the PIT.

Figure 1C gives an overview of how we compared the PIP, PIE,gold-standard, and our new experiments. In particular, Fig. 2shows the performance of the integration resulting in thePIP and PIE. When tested against the gold-standard, we observedthat the ratio of true to false positives (TP/FP) increasesmonotonically with L_cut, confirming L as an appropriate measureof the odds of a real interaction. Conservatively estimated,protein pairs with L > 600 have a better than 50% chanceof being in the same complex, suggesting L_cut = 600 as a usefulthreshold (19). Unless otherwise noted, we use this throughoutour analysis. It gives 9897 predicted interactions from thePIP and 163 from the PIE. In contrast, likelihood ratios derivedfrom single genomic features (e.g., mRNA coexpression) or fromindividual interaction experiments (e.g., the Ho data set) didnot exceed the cutoff when used alone, with TP/FP values farbelow 1. This demonstrates that information sources that, takenalone, are only weak predictors of interactions can yield reliablepredictions when combined.

Fig. 2. Comparison of PIP and PIE with each other and with the individual information sources. (A) The TP/FP ratio as a function of L_cut for the PIP and the individual data from which it was computed. The ratio is computed as follows:

where pos(L) and neg(L) are the number of positives and negatives in the gold-standard with a given likelihood ratio L. The vertical line indicates our standard threshold L_cut = 600. (B) The same plot as in (A), but for the PIE. (C) Comparison of TP/FP ratios between the PIP and PIE. The abscissa represents the sensitivity of the probabilistic interactomes. The gray area indicates the gain of sensitivity of the PIP over the PIE for equal TP/FP ratios. The arrow shows the difference in sensitivity at TP/FP = 0.3. At this level, the PIP contains 183,295 protein pairs, of which 6179 are gold-standard positives (75% sensitivity), whereas the PIE contains 31,511 protein pairs and 1758 gold-standard positives among these (21% sensitivity). This difference in sensitivity between PIE and PIP illustrates the value of the de novo prediction. It also reflects, to some degree, that the experiments were done only on subsets of the genome and may have been measuring different types of interactions than the complexes' gold-standard, which we used to parameterize the PIP. The white circles show the performance of a voting procedure in which each of the four genomic features (from which we computed the PIP) contributed an additive vote. There are four possible outcomes in the additive voting procedure, depending on how many data sets contribute a positive vote (19). [View Larger Version of this Image (21K GIF file)]

The PIP had a higher sensitivity than the PIE for comparableTP/FP ratios (Fig. 2C). ("Sensitivity" measures coverage andis defined as TP/P, where P is the number of gold-standard positives.)Specifically, the sensitivity of the PIP is 27% at our cutoff.This may seem low, but compares favorably with the PIE, whichhad a sensitivity of less than 1%. This means that we can predict,at comparable error levels, more complex interactions de novothan are present in the high-throughput experimental interactiondata sets.

One might ask whether simpler voting procedures can match theperformance of more complicated machine-learning methods suchas Bayesian networks. To test this hypothesis, we compared thePIP with a voting procedure where each of the four genomic featurescontributes an additive vote toward positive classification.We found that the Bayesian network achieved greater sensitivityfor comparable TP/FP ratios (Fig. 2C) (19).

Figure 3 shows parts of the PIP and PIE graphs and how thesecompare with the gold-standard and our new experiments. First,to test whether the thresholded PIP was biased toward certaincomplexes, we looked at the distribution of predictions amonggold-standard positives (Fig. 3A); they were roughly equallyapportioned among the different complexes, suggesting a lackof bias.

Fig. 3. Representations of the thresholded PIP (de novo prediction) compared with different data sets. (A) The complete set of gold-standard positives and their overlap with the PIP. The PIP (green) covers 27% of the gold-standard positives (yellow). (B) A graph of the largest complexes in the PIP, i.e., only those proteins in the thresholded PIP having

20 links. (Left) Overlapping gold-standard positives are shown in green, PIE links in blue, and overlaps with both the PIE and gold-standard positives in black. (Right) Overlapping gold-standard negatives are shown in red. Regions with many red links indicate potential false-positive predictions. (C) Three PIP complexes that we partially verified by TAP-tagging. Each complex contains the proteins linked to a central protein (gray) after thresholding the PIP at L_cut = 300. Interactions verified by our TAP-tagging are shown in dark blue and PIE links in light blue; gray links indicate where TAP-tagging overlapped with PIE links. [View Larger Version of this Image (36K GIF file)]

We have thus far treated all interactions as independent. However,the joint distribution of interactions in the PIs can help identifylarge complexes: An ideal complex should be a "clique" in aninteraction graph (i.e., a subgraph with N(N – 1)/2 linksbetween N proteins). Although this rarely happens in practice,because of incorrect or missing links, large complexes tendto have many interconnections within them, whereas false-positivelinks to outside proteins tend to occur randomly, without acoherent pattern (Fig. 4).

Fig. 4. TP/FP for subsets of the thresholded PIP that only include proteins with a minimum number of links. Requiring a minimum number of links isolates large complexes in the thresholded PIP graph (Fig. 3B). Increasing the minimum number of links raises TP/FP by preserving the interactions among proteins in large complexes, while filtering out false-positive interactions with heterogeneous groups of proteins outside the complexes. [View Larger Version of this Image (8K GIF file)]

Figure 3B shows parts of the thresholded PIP that are restrictedto proteins with 20 links (23), highlighting largecomplexes. Some predicted complexes overlap with the gold-standardpositives (cytoplasmic ribosome) or the PIE (exosome, RNA polymeraseI, 26S proteasome). Comparison with the gold-standard negativesshowed where the PIP likely produced false complexes. Many proteinassociations only appear in the PIP and thus potentially representnew interactions and complexes. An interesting example is themitochondrial ribosome; it has appreciable overlap with bothgold-standard positives and the PIE and contains plausible,newly predicted interactions with three proteins (19).

To further test the predictions in the PIP, we conducted TAP-taggingexperiments, in which a protein expressed at its normal intracellularconcentration ("bait") is tagged and used to "pull down" endogenousprotein complexes. We picked 98 proteins as TAP-tagging baits.These produced 424 experimental interactions overlapping withthe PIP thresholded at L_cut = 300. (Of these, 185, in turn,overlapped with gold-standard positives, and 16 with negatives,highlighting the reliability of our experiments.)

Figure 3C shows three examples of the overlap between the PIPand TAP-tagging. We predicted that the putative DEAD-box RNAhelicase Dbp3 interacts with three other RNA helicases (Hca4,Mak5, and Dbp7), with proteins implicated in ribosomal RNA (rRNA)metabolism (e.g., Nop2, Rrp5, Mak5, and components of RNA polymeraseI), and with Nsr1, the yeast homolog of mammalian Nucleolinand a GAR domain–containing protein (24). When Dbp3 wasTAP-tagged and purified, we found previously unknown interactionswith Nsr1, Hca4, and Nop1, connecting Dbp3 with known rRNA-processingproteins. Further purifications with TAP-tagged versions ofMak5, Rrp5, Dbp7, Dbp3, Nsr1, Hca4, and Nop2 verified the physicalassociation.

The nucleosome, a fundamental unit within chromatin, providesa second example of overlap. It is composed of eight histones(two H2A, two H2B, two H3, and two H4), which can block RNApolymerase II progression. This blockage is relieved upon interactionwith the FACT complex (also known as SPN or yFACT), which consistsof Spt16 and Pob3 in yeast. Mammalian Pob3 has a high mobilitygroup (HMG) domain for interaction with histones; however, yeastPob3 lacks this domain. Instead, the HMG protein Nhp6 (withtwo virtually identical isoforms, Nhp6A and Nhp6B) binds histones(25–27). [Nhp6 also binds DNA in competition with thenucleosome (28).] Our thresholded PIP and experimental datadocument a specific interaction between Nhp6A and Hhf1 (H4),pinpointing the contact between the nucleosome and Nhp6 to theH3-H4 heterodimer (Hhf1 and Hht1). This is plausible; becauseNhp6 has been shown not to influence nucleosome reassembly (29),it is unlikely that it binds with the H2A-H2B dimer, which needsto reassociate with the nucleosome after binding FACT.

The replication complex, a third experimental validation ofthe PIP, assembles and dissembles from transiently interactingsubcomplexes (e.g., MCM proteins, ORC, and polymerases) throughoutthe cell cycle (8, 30). Our predicted and experimentally verifiedinteractions connect it, probably transiently, to another subcomplex,replication factor A (RFA, composed of Rfa1, Rfa2, and Rfa3).Specifically, we predicted and verified interactions betweenRFA and two proteins associated with other replication subcomplexes:Rfa2 with Top2 (a component of the nuclear synaptonemal complex)and Rfa1 with Pri2 (DNA polymerase {alpha} –primase subunit).

Finally, we predicted and verified by TAP-tagging that two proteinsinvolved in translation elongation (Tef2 and Eft2) interact.This is plausible given that protein elongation is mediatedby three factors in yeast: EF-1 {alpha} (Tef1, Tef2), EF-2 (Eft1, Eft2),and EF-3 (Hef3, Yef3); most other eukaryotes lack EF-3. Previousexperimental data suggest an interaction between yeast EF-1 {alpha} and EF-3 (31). An interaction between EF-1 {alpha} and EF-2 had notbeen demonstrated, although this is reasonable given their similarroles in elongation and their overlapping binding sites on theribosome (32).

In summary, we have developed a Bayesian approach for integratingweakly predictive genomic features into reliable predictionsof protein-protein interactions. Our de novo prediction of complexesreplicated interactions found in the gold-standard positivesand PIE. In addition, we confirmed several of our predictionswith new experiments. The accuracy of the PIP was comparableto that of the PIE while simultaneously achieving greater coverage.

Our procedure lends itself naturally to the addition of morefeatures, possibly further improving results. We anticipatethat protein-protein interactions in organisms other than yeastcan be explored in similar ways.

References and Notes

1. P. Uetz et al., Nature 403, 623 (2000).[CrossRef][ISI][Medline]

2. T. Ito et al., Proc. Natl. Acad. Sci. U.S.A. 98, 4569 (2001).[Abstract/Free Full Text]

3. A. C. Gavin et al., Nature 415, 141 (2002).[CrossRef][ISI][Medline]

4. Y. Ho et al., Nature 415, 180 (2002).[CrossRef][ISI][Medline]

5. I. Xenarios et al., Nucleic Acids Res. 30, 303 (2002).[Abstract/Free Full Text]

6. H. W. Mewes et al., Nucleic Acids Res. 30, 31 (2002).[Abstract/Free Full Text]

7. G. D. Bader et al., Nucleic Acids Res. 29, 242 (2001).[Abstract/Free Full Text]

8. R. Jansen, D. Greenbaum, M. Gerstein, Genome Res. 12, 37 (2002).[Abstract/Free Full Text]

9. A. Kumar et al., Genes Dev. 16, 707 (2002).[Abstract/Free Full Text]

10. C. von Mering et al., Nature 417, 399 (2002).[CrossRef][ISI][Medline]

11. A. M. Deane, L. Salwinski, I. Xenarios, D. Eisenberg, Mol. Cell. Proteomics 1, 349 (2002).[Abstract/Free Full Text]

12. A. M. Edwards et al., Trends Genet. 18, 529 (2002).[CrossRef][ISI][Medline]

13. G. D. Bader, C. W. Hogue, Nature Biotechnol. 20, 991 (2002).[CrossRef][ISI][Medline]

14. A. Kumar, M. Snyder, Nature 415, 123 (2002).[CrossRef][ISI][Medline]

15. A. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, D. Eisenberg, Nature 402, 83 (1999).[CrossRef][ISI][Medline]

16. M. Steffen, A. Petti, J. Aach, P. D'Haeseleer, G. Church, BMC Bioinformatics 3, 34 (2002).[CrossRef][Medline]

17. R. Jansen, N. Lan, J. Qian, M. Gerstein, J. Struct. Funct. Genomics 2, 71 (2002).[CrossRef][Medline]

18. A. Drawid, M. Gerstein, J. Mol. Biol 301, 1059 (2000).[CrossRef][ISI][Medline]

19. Materials and methods are available as supporting material on Science Online.

20. T. R. Hughes et al., Cell 102, 109 (2000).[ISI][Medline]

21. R. J. Cho et al., Mol. Cell 2, 65 (1998).[ISI][Medline]

22. M. Ashburner et al., Nature Genet. 25, 25 (2000).[CrossRef][ISI][Medline]

23. See http://genecensus.org/intint.

24. I. P. Girard et al., EMBO J. 11, 673 (1992).[Abstract]

25. N. K. Brewster, G. C. Johnston, R. A. Singer, Mol. Cell. Biol. 21, 3491 (2001).[Abstract/Free Full Text]

26. A. A. Travers, EMBO Rep. 4, 131 (2003).[Abstract/Free Full Text]

27. T. Formosa et al., Genetics 162, 1557 (2002).[Abstract/Free Full Text]

28. Y. Yu, P. Eriksson, L. T. Bhoite, D. J. Stillman, Mol. Cell. Biol. 23, 1910 (2003).[Abstract/Free Full Text]

29. R. C. Bash, J. M. Vargason, S. Cornejo, P. S. Ho, D. Lohr, J. Biol. Chem. 276, 861 (2001).[Abstract/Free Full Text]

30. O. M. Aparicio, D. M. Weinstein, S. P. Bell, Cell 91, 59 (1997).[CrossRef][ISI][Medline]

31. M. Anand, K. Chakraburtty, M. J. Marton, A. G. Hinnebusch, T. G. Kinzy, J. Biol. Chem. 278, 6985 (2003).[Abstract/Free Full Text]

32. O. Kovalchuke, R. Kambampati, E. Pladies, K. Chakraburtty, Eur. J. Biochem. 258, 986 (1998).[Abstract]

33. We thank C. Sander and G. Bader for critical discussions.

Supporting Online Material

www.sciencemag.org/cgi/content/full/302/5644/449/DC1

Materials and Methods

Figs. S1 to S3

Tables S1 and S2

References

29 May 2003; accepted 29 August 2003
10.1126/science.1087361
Include this information when citing this paper.

Abstract of this Article

PDF Version of this Article

Supporting Online Material

Download to Citation Manager

Alert me when:
new articles cite this article

Search for similar articles in:
Science Online
PubMed

Search Medline for articles by:
Jansen, R. || Gerstein, M.

This article appears in the following Subject Collections:
Cell Biology

Volume 302, Number 5644, Issue of 17 Oct 2003, pp. 449-453.
Copyright © 2003 by The American Association for the Advancement of Science. All rights reserved.