A Bayesian Networks Approach for Predicting Protein-Protein
Interactions from Genomic Data Ronald
Jansen,1*
Haiyuan Yu,1 Dov
Greenbaum,1 Yuval Kluger,1
Nevan J. Krogan,4 Sambath
Chung,1,2 Andrew
Emili,4 Michael Snyder,2
Jack F. Greenblatt,4 Mark
Gerstein1,3
We have developed an approach using Bayesian networks to
predict protein-protein interactions genome-wide in
yeast. Our method naturally weights and combines into
reliable predictions genomic features only weakly
associated with interaction (e.g., messenger
RNAcoexpression, coessentiality, and colocalization). In
addition to de novo predictions, it can integrate often
noisy, experimental interaction data sets. We observe
that at given levels of sensitivity, our predictions are
more accurate than the existing high-throughput
experimental data sets. We validate our predictions with
TAP (tandem affinity purification) tagging experiments.
Our analysis, which gives a comprehensive view of yeast
interactions, is available at genecensus.org/intint.
1 Department of Molecular Biophysics and Biochemistry,
Yale University, 266 Whitney Avenue, Post Office Box 208114, New
Haven, CT 06520, USA. 2 Department of Molecular,
Cellular and Developmental Biology, Yale University, 266 Whitney
Avenue, Post Office Box 208114, New Haven, CT 06520,
USA. 3 Department of Computer Science, Yale
University, 266 Whitney Avenue, Post Office Box 208114, New Haven,
CT 06520, USA. 4 Banting and Best Department of
Medical Research, Department of Molecular and Medical Research,
University of Toronto, Toronto, M5G 1L6, Ontario, Canada.
* Present address:
Computational Biology Center, Memorial Sloan-Kettering
Cancer Center, 307 West 63rd Street, New York, NY 10021,
USA.
To whom correspondence should be addressed. E-mail:
mark.gerstein@yale.edu
Many fundamental biological processes involve protein-protein
interactions, and comprehensively identifying them is
important to systematically defining their cellular role.
New experimental and computational methods have vastly
increased the number of known or putative interactions,
cataloged in databases (1–7).
Much genomic information also relates to interactions
indirectly: Interacting proteins are often significantly
coexpressed (as shown by microarrays) and colocalized to
the same subcellular compartment (8,
9).
Unfortunately, interaction data sets are often incomplete
and contradictory (10–12).
In the context of genome-wide analyses, these
inaccuracies are greatly magnified because the protein
pairs that do not interact (negatives) far outnumber those
that do (positives). For instance, in yeast, the 6000 proteins allow
for 18 million
potential interactions, but the estimated number of
actual interactions is <100,000 (10,
13,
14).
Thus, even reliable techniques can generate many false
positives when applied genome-wide. This is similar to a
diagnostic with a 1% false-positive rate for a rare
disease occurring in 0.1% of the population, which would
roughly produce one true positive for every 10 false
ones. Further information is necessary.
Consequently, when evaluating protein-protein interactions,
one needs to integrate evidence from many different
sources (15–17).
Here, we propose a Bayesian approach for integrating
interaction information that allows for the probabilistic
combination of multiple data sets and demonstrate its
application to yeast (18).
Our approach can be used for combining noisy interaction
data sets and for predicting interactions de novo, from
other genomic information. The basic idea is to assess
each source of evidence for interactions by comparing it
against samples of known positives and negatives
("gold-standards"), yielding a statistical reliability.
Then, extrapolating genome-wide, we predict the chance of
possible interactions for every protein pair by combining
each independent evidence source according to its
reliability. We verified our predictions by comparing
them against existing experimental interaction data (not
in the gold-standard) as well as new TAP (tandem affinity
purification) tagging experiments.
Among the many possible machine-learning approaches that
could be applied to predicting interactions (ranging from
simple unions and intersections of data sets to neural
networks, decision trees, and support-vector machines),
Bayesian networks have several advantages (19):
They allow for combining highly dissimilar types of data
(i.e., numerical and categorical), converting them to a
common probabilistic framework, without unnecessary
simplification; they readily accommodate missing data; and
they naturally weight each information source according
to its reliability. In contrast to "black-box"
predictors, Bayesian networks are readily interpretable
as they represent conditional probability relationships
among information sources.
The gold-standard data set on which we train
("parameterize") the Bayesian network should ideally be
(i) independent from the data sources serving as
evidence, (ii) sufficiently large for reliable
statistics, and (iii) free of systematic bias. We used
the MIPS (Munich Information Center for Protein Sequences)
complexes catalog as the gold-standard for positives (6).
This hand-curated list of proteincomplexes is based on
the literature [8250 pairs in our filtered version (19)].
A negatives gold-standard is harder to define, but
essential for successful training. Thus, we synthesized
negatives from lists of proteins in separate subcellular
compartments (9).
Our positive and negative gold-standards satisfy the
first two criteria and provide a good practical solution
for the third. Hence, our goal, precisely defined, was to
predict whether two proteins are in the same complex, not
whether they necessarily had direct physical contact.
As a measure of reliability, the overlap of information
sources (i.e., "interaction data sets," which could
either be noisy experimental data or sets of genomic
features) with the gold-standards can be expressed in
terms of a "likelihood ratio." For example, consider a
genomic feature f expressed in binary terms (i.e.,
"present" or "absent"). The likelihood ratio
L(f) is then defined as the fraction of
gold-standard positives having feature f divided
by the fraction of negatives having f. For two features
f1 and f2 with
uncorrelated evidence, the likelihood ratio of the
combined evidence is simply the product
L(f1, f2) =
L(f1)L(f2).
For correlated evidence, L(f1,
f2) cannot be factorized in this way.
Bayesian networks are a formal representation of such
relationships between features. The combined likelihood
ratio is proportional to the estimated odds that two
proteins are in the same complex, given multiple sources
of information.
We predict a protein pair as positive if its combined
likelihood ratio exceeds a particular cutoff (L
> Lcut) (negative otherwise). To get
an overall assessment of how the prediction performs, we
segmented the gold-standard into separate training and testing
sets (using a sevenfold cross-validation protocol). Then
we evaluated the number of true- (TP) and
false-positive (FP) predictions in the testing
set. Finally, we applied the Bayesian network beyond the
testing set, computing likelihood ratios for all possible
protein pairs in the genome.
Figure
1 schematically shows the information sources and results
of our calculations. We term the results "probabilistic
interactomes" (PIs), in which each protein pair is
associated with a probability measure for being in the
same complex (i.e., likelihood ratio L). Our
procedure not only allows combining existing experimental
interaction data sets (resulting in a PI-experimental or
"PIE"), but also the de novo prediction of protein
complexes from genomic data sets (when the input data are
not interaction data sets per se, resulting in a
PI-predicted or "PIP").
Fig. 1. The information sources
integrated in our analysis and their comparison with each
other. (A) The three different types of data used: (i)
Interaction data from high-throughput experiments. These
comprise large-scale two-hybrid screens (Y2H) (1,
2)
and in vivo pull-down experiments (3,
4).
(ii) Other genomic features. We considered expression data,
biological function of proteins (from Gene Ontology biological
process and the MIPS functional catalog), and data about
whether proteins are essential (6,
19–22).
(iii) Gold-standards of known interactions and noninteracting
protein pairs. (The MIPS functional catalog differs from the
MIPS complexes catalog used for the gold-standard.) (B)
Combination of data sets into probabilistic interactomes.
(C) Comparison of the probabilistic interactomes with
the gold-standards and our new experimental data. Numbers next
to the arrows indicate which figures refer to these various
comparisons. [View
Larger Version of this Image (25K GIF file)]
|
We combined four interaction data sets from high-throughput
experiments into the PIE (1–4)
(Fig.
1B). The PIE represents a transformation of the
individual binary-valued interaction sets into a data set
where every protein pair is weighted according to the
likelihood that it exists within a complex.
We computed the PIP from several genomic data sources: the
correlation of mRNA amounts in two expression data sets
(one with temporal profiles during the cell cycle, one of
expression levels under 300 cellular conditions), two
sets of information on biological function, and
information about whether proteins are essential for
survival (6,
20–22).
Although none of these information sources are
interaction data per se, they contain information weakly
associated with interaction: Two subunits of the same
protein complex often have coregulated mRNA expression and
similar biological functions and are more likely to be
both essential or nonessential (8).
For computing the PIE and the PIP, we used two different
types of Bayesian networks: a "naïve" network for the PIP
and a fully connected one for the PIE (19).
The naïve network is simpler to compute but requires
information sources with essentially uncorrelated
evidence. In contrast, the fully connected Bayesian
network accommodates correlated evidence, which is the
case for the four experimental interaction data sets.
Finally, we combined the PIP, PIE, and gold-standard into a
total PI (PIT), which represents our most comprehensive
view of the known and putative protein complexes in yeast
(23).
Because the PIP and PIE data provide essentially
uncorrelated evidence for protein-protein interactions,
we chose a naïve network to construct the PIT.
Figure
1C gives an overview of how we compared the PIP, PIE,
gold-standard, and our new experiments. In particular, Fig.
2 shows the performance of the integration resulting
in the PIP and PIE. When tested against the
gold-standard, we observed that the ratio of true to
false positives (TP/FP) increases
monotonically with Lcut, confirming L
as an appropriate measure of the odds of a real
interaction. Conservatively estimated, protein pairs with
L > 600 have a better than 50% chance of being
in the same complex, suggesting Lcut = 600 as a
useful threshold (19).
Unless otherwise noted, we use this throughout our
analysis. It gives 9897 predicted interactions from the
PIP and 163 from the PIE. In contrast, likelihood ratios
derived from single genomic features (e.g., mRNA
coexpression) or from individual interaction experiments
(e.g., the Ho data set) did not exceed the cutoff when
used alone, with TP/FP values far below 1.
This demonstrates that information sources that, taken
alone, are only weak predictors of interactions can yield
reliable predictions when combined.
Fig. 2. Comparison of PIP and
PIE with each other and with the individual information
sources. (A) The TP/FP ratio as a
function of Lcut for the PIP and the
individual data from which it was computed. The ratio is
computed as follows:
|
where pos(L) and neg(L) are the number of
positives and negatives in the gold-standard with a given
likelihood ratio L. The vertical line indicates our
standard threshold Lcut = 600. (B)
The same plot as in (A), but for the PIE. (C)
Comparison of TP/FP ratios between the PIP and
PIE. The abscissa represents the sensitivity of the
probabilistic interactomes. The gray area indicates the gain
of sensitivity of the PIP over the PIE for equal
TP/FP ratios. The arrow shows the difference in
sensitivity at TP/FP = 0.3. At this level, the
PIP contains 183,295 protein pairs, of which 6179 are
gold-standard positives (75% sensitivity), whereas the PIE
contains 31,511 protein pairs and 1758 gold-standard positives
among these (21% sensitivity). This difference in sensitivity
between PIE and PIP illustrates the value of the de novo
prediction. It also reflects, to some degree, that the
experiments were done only on subsets of the genome and may
have been measuring different types of interactions than the
complexes' gold-standard, which we used to parameterize the
PIP. The white circles show the performance of a voting
procedure in which each of the four genomic features (from
which we computed the PIP) contributed an additive vote. There
are four possible outcomes in the additive voting procedure,
depending on how many data sets contribute a positive vote (19).
[View
Larger Version of this Image (21K GIF file)]
|
The PIP had a higher sensitivity than the PIE for comparable
TP/FP ratios (Fig.
2C). ("Sensitivity" measures coverage and is defined
as TP/P, where P is the number of gold-standard
positives.) Specifically, the sensitivity of the PIP is
27% at our
cutoff. This may seem low, but compares favorably with
the PIE, which had a sensitivity of less than 1%. This
means that we can predict, at comparable error levels,
more complex interactions de novo than are present in the
high-throughput experimental interaction data sets.
One might ask whether simpler voting procedures can match
the performance of more complicated machine-learning
methods such as Bayesian networks. To test this
hypothesis, we compared the PIP with a voting procedure
where each of the four genomic features contributes an
additive vote toward positive classification. We found
that the Bayesian network achieved greater sensitivity
for comparable TP/FP ratios (Fig.
2C) (19).
Figure
3 shows parts of the PIP and PIE graphs and how these
compare with the gold-standard and our new experiments.
First, to test whether the thresholded PIP was biased
toward certain complexes, we looked at the distribution
of predictions among gold-standard positives (Fig.
3A); they were roughly equally apportioned among the
different complexes, suggesting a lack of bias.
Fig. 3. Representations of the
thresholded PIP (de novo prediction) compared with different
data sets. (A) The complete set of gold-standard
positives and their overlap with the PIP. The PIP (green)
covers 27% of the gold-standard positives (yellow). (B)
A graph of the largest complexes in the PIP, i.e., only those
proteins in the thresholded PIP having 20 links. (Left)
Overlapping gold-standard positives are shown in green, PIE
links in blue, and overlaps with both the PIE and
gold-standard positives in black. (Right) Overlapping
gold-standard negatives are shown in red. Regions with many
red links indicate potential false-positive predictions.
(C) Three PIP complexes that we partially verified by
TAP-tagging. Each complex contains the proteins linked to a
central protein (gray) after thresholding the PIP at
Lcut = 300. Interactions verified by our
TAP-tagging are shown in dark blue and PIE links in light
blue; gray links indicate where TAP-tagging overlapped with
PIE links. [View
Larger Version of this Image (36K GIF file)]
|
We have thus far treated all interactions as independent.
However, the joint distribution of interactions in the
PIs can help identify large complexes: An ideal complex
should be a "clique" in an interaction graph (i.e., a
subgraph with N(N – 1)/2 links between
N proteins). Although this rarely happens in practice,
because of incorrect or missing links, large complexes
tend to have many interconnections within them, whereas
false-positive links to outside proteins tend to occur
randomly, without a coherent pattern (Fig.
4).
Fig. 4. TP/FP for
subsets of the thresholded PIP that only include proteins with
a minimum number of links. Requiring a minimum number of links
isolates large complexes in the thresholded PIP graph (Fig.
3B). Increasing the minimum number of links raises
TP/FP by preserving the interactions among
proteins in large complexes, while filtering out
false-positive interactions with heterogeneous groups of
proteins outside the complexes. [View
Larger Version of this Image (8K GIF file)]
|
Figure
3B shows parts of the thresholded PIP that are restricted
to proteins with 20 links (23),
highlighting large complexes. Some predicted complexes
overlap with the gold-standard positives (cytoplasmic
ribosome) or the PIE (exosome, RNA polymerase I,
26S proteasome). Comparison with the gold-standard
negatives showed where the PIP likely produced false
complexes. Many protein associations only appear in the
PIP and thus potentially represent new interactions and
complexes. An interesting example is the mitochondrial
ribosome; it has appreciable overlap with both
gold-standard positives and the PIE and contains
plausible, newly predicted interactions with three
proteins (19).
To further test the predictions in the PIP, we conducted
TAP-tagging experiments, in which a protein expressed at
its normal intracellular concentration ("bait") is tagged
and used to "pull down" endogenous protein complexes. We
picked 98 proteins as TAP-tagging baits. These produced
424 experimental interactions overlapping with the PIP
thresholded at Lcut = 300. (Of these, 185, in
turn, overlapped with gold-standard positives, and 16
with negatives, highlighting the reliability of our
experiments.)
Figure
3C shows three examples of the overlap between the PIP
and TAP-tagging. We predicted that the putative DEAD-box
RNA helicase Dbp3 interacts with three other RNA
helicases (Hca4, Mak5, and Dbp7), with proteins
implicated in ribosomal RNA (rRNA) metabolism (e.g.,
Nop2, Rrp5, Mak5, and components of RNA polymerase I),
and with Nsr1, the yeast homolog of mammalian Nucleolin
and a GAR domain–containing protein (24).
When Dbp3 was TAP-tagged and purified, we found
previously unknown interactions with Nsr1, Hca4, and
Nop1, connecting Dbp3 with known rRNA-processing
proteins. Further purifications with TAP-tagged versions
of Mak5, Rrp5, Dbp7, Dbp3, Nsr1, Hca4, and Nop2 verified
the physical association.
The nucleosome, a fundamental unit within chromatin,
provides a second example of overlap. It is composed of
eight histones (two H2A, two H2B, two H3, and two H4),
which can block RNA polymerase II progression. This
blockage is relieved upon interaction with the FACT
complex (also known as SPN or yFACT), which consists of
Spt16 and Pob3 in yeast. Mammalian Pob3 has a high mobility
group (HMG) domain for interaction with histones; however,
yeast Pob3 lacks this domain. Instead, the HMG protein
Nhp6 (with two virtually identical isoforms, Nhp6A and
Nhp6B) binds histones (25–27).
[Nhp6 also binds DNA in competition with the nucleosome
(28).]
Our thresholded PIP and experimental data document a
specific interaction between Nhp6A and Hhf1 (H4),
pinpointing the contact between the nucleosome and Nhp6 to
the H3-H4 heterodimer (Hhf1 and Hht1). This is plausible;
because Nhp6 has been shown not to influence nucleosome
reassembly (29),
it is unlikely that it binds with the H2A-H2B dimer, which
needs to reassociate with the nucleosome after binding
FACT.
The replication complex, a third experimental validation of
the PIP, assembles and dissembles from transiently
interacting subcomplexes (e.g., MCM proteins, ORC, and
polymerases) throughout the cell cycle (8,
30).
Our predicted and experimentally verified interactions
connect it, probably transiently, to another subcomplex,
replication factor A (RFA, composed of Rfa1, Rfa2, and
Rfa3). Specifically, we predicted and verified
interactions between RFA and two proteins associated with
other replication subcomplexes: Rfa2 with Top2 (a
component of the nuclear synaptonemal complex) and Rfa1
with Pri2 (DNA polymerase –primase subunit).
Finally, we predicted and verified by TAP-tagging that two
proteins involved in translation elongation (Tef2 and
Eft2) interact. This is plausible given that protein
elongation is mediated by three factors in yeast:
EF-1 (Tef1,
Tef2), EF-2 (Eft1, Eft2), and EF-3 (Hef3, Yef3); most
other eukaryotes lack EF-3. Previous experimental data
suggest an interaction between yeast EF-1 and EF-3 (31).
An interaction between EF-1 and EF-2 had not
been demonstrated, although this is reasonable given their
similar roles in elongation and their overlapping binding
sites on the ribosome (32).
In summary, we have developed a Bayesian approach for
integrating weakly predictive genomic features into
reliable predictions of protein-protein interactions. Our
de novo prediction of complexes replicated interactions
found in the gold-standard positives and PIE. In
addition, we confirmed several of our predictions with
new experiments. The accuracy of the PIP was comparable
to that of the PIE while simultaneously achieving greater
coverage.
Our procedure lends itself naturally to the addition of more
features, possibly further improving results. We
anticipate that protein-protein interactions in organisms
other than yeast can be explored in similar ways.
References and
Notes
1. |
P. Uetz et al., Nature
403, 623 (2000).[CrossRef][ISI][Medline] |
2. |
T. Ito et al., Proc. Natl. Acad.
Sci. U.S.A. 98, 4569 (2001).[Abstract/Free Full Text] |
3. |
A. C. Gavin et al., Nature
415, 141 (2002).[CrossRef][ISI][Medline] |
4. |
Y. Ho et al., Nature
415, 180 (2002).[CrossRef][ISI][Medline] |
5. |
I. Xenarios et al., Nucleic Acids
Res. 30, 303 (2002).[Abstract/Free Full Text] |
6. |
H. W. Mewes et al., Nucleic Acids
Res. 30, 31 (2002).[Abstract/Free Full Text] |
7. |
G. D. Bader et al., Nucleic Acids
Res. 29, 242 (2001).[Abstract/Free Full Text] |
8. |
R. Jansen, D. Greenbaum, M.
Gerstein, Genome Res. 12, 37 (2002).[Abstract/Free Full Text] |
9. |
A. Kumar et al., Genes Dev.
16, 707 (2002).[Abstract/Free Full Text] |
10. |
C. von Mering et al., Nature
417, 399
(2002).[CrossRef][ISI][Medline] |
11. |
A. M. Deane, L. Salwinski, I.
Xenarios, D. Eisenberg, Mol. Cell. Proteomics 1,
349 (2002).[Abstract/Free Full Text] |
12. |
A. M. Edwards et al., Trends
Genet. 18, 529 (2002).[CrossRef][ISI][Medline] |
13. |
G. D. Bader, C. W. Hogue, Nature
Biotechnol. 20, 991 (2002).[CrossRef][ISI][Medline] |
14. |
A. Kumar, M. Snyder, Nature
415, 123
(2002).[CrossRef][ISI][Medline] |
15. |
A. M. Marcotte, M. Pellegrini, M. J.
Thompson, T. O. Yeates, D. Eisenberg, Nature
402, 83 (1999).[CrossRef][ISI][Medline] |
16. |
M. Steffen, A. Petti, J. Aach, P.
D'Haeseleer, G. Church, BMC Bioinformatics 3, 34
(2002).[CrossRef][Medline] |
17. |
R. Jansen, N. Lan, J. Qian, M.
Gerstein, J. Struct. Funct. Genomics 2, 71
(2002).[CrossRef][Medline] |
18. |
A. Drawid, M. Gerstein, J. Mol.
Biol 301, 1059 (2000).[CrossRef][ISI][Medline] |
19. |
Materials and methods are available
as supporting material on Science Online. |
20. |
T. R. Hughes et al.,
Cell 102, 109 (2000).[ISI][Medline] |
21. |
R. J. Cho et al., Mol.
Cell 2, 65 (1998).[ISI][Medline] |
22. |
M. Ashburner et al.,
Nature Genet. 25, 25 (2000).[CrossRef][ISI][Medline] |
23. |
See http://genecensus.org/intint. |
24. |
I. P. Girard et al., EMBO
J. 11, 673 (1992).[Abstract] |
25. |
N. K. Brewster, G. C. Johnston, R.
A. Singer, Mol. Cell. Biol. 21, 3491 (2001).[Abstract/Free Full Text] |
26. |
A. A. Travers, EMBO Rep.
4, 131 (2003).[Abstract/Free Full Text] |
27. |
T. Formosa et al.,
Genetics 162, 1557 (2002).[Abstract/Free Full Text] |
28. |
Y. Yu, P. Eriksson, L. T. Bhoite, D.
J. Stillman, Mol. Cell. Biol. 23, 1910 (2003).[Abstract/Free Full Text] |
29. |
R. C. Bash, J. M. Vargason, S.
Cornejo, P. S. Ho, D. Lohr, J. Biol. Chem. 276,
861 (2001).[Abstract/Free Full Text] |
30. |
O. M. Aparicio, D. M. Weinstein, S.
P. Bell, Cell 91, 59 (1997).[CrossRef][ISI][Medline] |
31. |
M. Anand, K. Chakraburtty, M. J.
Marton, A. G. Hinnebusch, T. G. Kinzy, J. Biol. Chem.
278, 6985
(2003).[Abstract/Free Full Text] |
32. |
O. Kovalchuke, R. Kambampati, E.
Pladies, K. Chakraburtty, Eur. J. Biochem. 258,
986 (1998).[Abstract] |
33. |
We thank C. Sander and G. Bader for
critical discussions. |
Supporting Online Material
www.sciencemag.org/cgi/content/full/302/5644/449/DC1
Materials and Methods
Figs. S1 to S3
Tables S1 and S2
References
29 May 2003; accepted 29 August
2003 10.1126/science.1087361 Include this information when
citing this paper.
Volume 302, Number 5644, Issue of 17 Oct 2003,
pp. 449-453. Copyright © 2003 by The American Association for the
Advancement of Science. All rights reserved.
|