Bioinformatics | Pages |
© |
Abstract Motivation: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, `intermediate' sequence (Q -> I -> M). This approach has often been suggested as providing greater sensitivity in sequence comparison; however, it has not yet been possible to gauge its improvement precisely.
Results: Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called `pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called `baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison (i.e. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0.001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular, 862 of the baseline set pairs fit to better than 2.6 Å RMS, and transitive matching can find 62 of these (9%).
Availability: All the test data, including precise similarity values calculated from structural alignment, are available in tabular format over the Web from http://bioinfo.mbb.yale.edu/align
Contact: Mark.Gerstein@yale.edu
Transitive sequence matching is an approach taken toward improving sequence comparison. It entails taking the matches found after running a sequence comparison and then re-running them as new queries against the databank. The resulting matches consist of many of the previous matches plus (hopefully) some new ones. These new matches are, in turn, related back to the initial query only indirectly through an intermediate sequence (see TIL in Figure 1). This whole process can be repeated again, iteratively, with these new matches.
Data
Sequences with known structure were taken from the Protein Databank (PDB) (Bernstein et al., 1977). Fold definitions were taken from scop, Version 1.32 (May 1996) (Brenner, 1996; Murzin et al., 1996; Hubbard et al., 1997). Only the superfamily pairs, as opposed to fold pairs, were used as these have a much clearer structural relationship. It is the intention of the creators of scop that the superfamily pairs represent an evolutionary relationship between proteins with no appreciable sequence similarity, i.e. link proteins that are true homologues (Murzin et al., 1995; Hubbard, 1997). However, this is necessarily speculative, and all one can know for certain is that these pairs have a close structural relationship. Furthermore, the scop pairs have been extensively checked by both manual and automatic methods (Gerstein and Levitt, 1998), and are believed not to contain any false positives.
Based on the scop pairs, Brenner et al. (1995, 1998) clustered the PDB into 905 representative sequences at 40% identity (domains split between different chains are omitted from this count), making a list denoted pdb40d, which is distributed through the scop website (http://scop.mrc-lmb.cam.ac.uk/scop). The clustering employed a single-linkage approach similar to that in Hobohm et al. (1992, 1994), i.e. `select until done'. For the indirect sequence matching, pdb40d was compared against the 142 737 total sequences in the OWL composite databank (Version 27.1) (Bleasby et al., 1994). Low-complexity sequences were filtered out of OWL using the SEG program (Wootton and Federhen, 1993).
The overall analysis was greatly expedited by using a simple relational database implemented using DBM and perl5 (Wall et al., 1996). A number of detailed tables relevant to this paper will be made available over the Internet at http://bioinfo.mbb.yale.edu/align.
Sequence comparison
All sequence matching was with the FASTA program (Version 2.0) (Lipman and Pearson, 1985; Pearson and Lipman, 1988; Pearson, 1996, 1998; Pearson et al., 1997) with a k-tup value of 1. This program was chosen for a number of reasons.
Structure comparison
Structure matching was by the iterative dynamic programming method from Gerstein and Levitt (1996, 1998). This method aligns protein sequences on the basis of direct comparison of the corresponding three-dimensional structures. Two numbers characterize the alignment: the number of residues aligned (N) and the RMS deviation in C[alpha] positions after these atoms are fit onto each other (RMS). Since an alignment with a higher RMS value can be more significant than one with a lower RMS if there are more residues included in the first alignment, Gerstein and Levitt (1998) define a scaled RMS: RMS = 225 RMS/(N + 135). For an approximately average match of 90 residues, the scaled RMS is nearly the same as RMS (both quantities agree to within 10% for N between 70 and 110 residues). The distribution of scaled RMS values has a median value of 2.65 Å (with a mean of 2.68 Å and a SD of 0.87 Å), so 2.6 Å marks the approximate halfway point in the range of values and a reasonable division point. Levitt and Gerstein (1998), furthermore, show that this scaled RMS threshold corresponds approximately to a structural similarity P value of 0.01.
Overlap on intermediate sequence
In the transitive matching procedure, two sequences corresponding to structures (denoted Q for the query and M for the match) are linked through an intermediate sequence I. One has to take care that the region of match of Q on I overlaps with that of I on M. The criterion for overlap used here was quite conservative: the overlap region of Q on I must share at least 60 residues with that of I on M. Other less stringent criteria were tried. These tend to increase the number of matches, both true and false, but not to affect the results greatly, as long as they were reasonable.
As a practical matter, one can deal with the overlap when performing a search as follows. One runs the query against the whole databank, finding a number of direct matches Ii. Then the precise matching regions of each Ii are `cut out' and this is re-run against the databank again, producing matches Mi, j, which are then indirectly linked back to the original query Q. While simple and elegant, this procedure has the effect of changing the scores linking Ii and Mi, j relative to those found in an all-versus-all of the databank since the length of the intermediate sequence Ii is different when it is matched by Q or used as the query to find Mi, j.
The test data: sets of structural relationships
The analysis begins with the 8330 domains in the PDB indexed by the current version of scop. These are clustered into 905 representative sequences at a 40% identity level in the publicly distributed PDB40D dataset (see Methods). About 400 000 pairs can be formed from these representative sequences (i.e. 408 156 = 970 × 969/2).
Test set 1. By definition, all these pairs have a distant (`twilight-zone') level of sequence similarity. However, according to the scop classification (Murzin et al., 1995), 2055 (~0.5%) of them have a significant structural relationship, in being joined to one of 171 structural superfamilies (see Methods). These 2055 form the first set of `scop pairs'. They are not distributed equally amongst the scop superfamilies, with one superfamily (the Rossmann fold) containing 231 pairs and 70 others with just a single pair (Table 2). Furthermore, not all the structural relationships in these pairs are of equal weight, so it is worthwhile to consider two further `selections' based on somewhat closer structural relationships.
Test set 2. Because determining the structural similarity of short sequences is particularly problematic, one can exclude sequences of <60 amino acids. This gives 783 representative sequences, 305 371 possible pairs and 1801 structural relationships, which constitute a selection of 2055 original scop pairs. Gerstein and Levitt (1996, 1998) constructed structural alignments for all the preceding 1801 structural similarities of full-length sequences. These alignments allow one to determine the precise degree of structural similarity by calculating an RMS value from fitting the aligned atoms. There are 862 pairs that align with a scaled RMS of <2.6 Å (see Methods), and these form a second test set of scop pairs. They are (roughly) the more structurally similar half of the scop pairs.
Table 1.
FASTA e-value cut-off |
Type of pair | Pairs in total: test sets |
Linked directly |
Pairs remaining: baseline sets |
Linked indirectly within dataset |
Linked indirectly via OWL sequence |
1.0E-05 | TPs | 2055 | 220 11% | 1835 | 19 1.0% | 67 3.7% |
Low-RMS TPs | 862 | 162 19% | 700 | 15 2.1% | 62 8.9% | |
FPs | 0 | 0 | 12 | |||
1.0E-04 | TPs | 2055 | 271 13% | 1784 | 28 1.6% | 73 4.1% |
Low-RMS TPs | 862 | 198 23% | 664 | 17 2.6% | 64 9.6% | |
FPs | 1 | 0 | 13 | |||
1.0E-03 | TPs | 2055 | 313 15% | 1742 | 23 1.3% | 86 4.9% |
Low-RMS TPs | 862 | 219 25% | 643 | 18 2.8% | 74 12% | |
FPs | 3 | 1 | 13 |
Table 2.
Pairs per superfamily |
No. of super-families | Total no. of scop pairs |
No. linked directly |
Frac. linked directly |
No. of indirect or dir. links |
Frac. linked ind. or dir. |
231 | 1 | 231 | 21 | 9% | 26 | 11% |
171 | 1 | 171 | 5 | 3% | 9 | 5% |
153 | 1 | 153 | 3 | 2% | 3 | 2% |
120 | 2 | 240 | 22 | 9% | 47 | 20% |
91 | 1 | 91 | 36 | 40% | 57 | 63% |
78 | 1 | 78 | 4 | 5% | 4 | 5% |
55 | 3 | 165 | 12 | 7% | 12 | 7% |
36 | 3 | 108 | 22 | 20% | 28 | 26% |
28 | 8 | 224 | 21 | 9% | 24 | 11% |
21 | 6 | 126 | 45 | 36% | 53 | 42% |
15 | 4 | 60 | 7 | 12% | 8 | 13% |
10 | 11 | 110 | 15 | 14% | 16 | 15% |
6 | 17 | 102 | 29 | 28% | 33 | 32% |
3 | 42 | 126 | 42 | 33% | 46 | 37% |
1 | 70 | 70 | 29 | 41% | 33 | 47% |
Total | 171 | 2055 | 313 | 15% | 399 | 19% |
Direct linkage: the baseline for comparison
To measure the effectiveness of transitive matching, one can look at how many of the scop pairs in each of the three sets indirect linkage can find, relative to the number of false positives. However, before doing this, one has to determine how many pairs normal sequence comparison can find, in order to establish a baseline upon which transitive matching can improve. Here, pairs found by normal sequence comparison are called direct linkages since they involve no intermediate sequences (TDL; Figure 1).
This is essentially what Brenner et al. (1995, 1998; Brenner, 1996) did in asking how many of these real structural relationships could be found using the FASTA and BLASTP programs with their probabilistic scoring schemes. Here, these results are reproduced for the three specific sets of scop pairs discussed above. As Brenner et al. found, at a practical e-value threshold of 0.001, FASTA can find `direct linkages' for ~15% of the relationships in the first set of the scop pairs with only a few false positives.
Notice that when one considers just the 862 pairs with a close structural alignment (test set 2), it is possible to find a higher fraction of the pairs (25%). This last result is reasonable, given the established relationship between divergence in structure and sequence (Chothia and Lesk, 1986, 1987; Chothia and Gerstein, 1997). That is, the pairs with greater structural similarity are expected to have more sequence similarity.
As shown in Table 2 and 3, the fraction of pairs found varies considerably amongst the scop superfamilies, with a larger fraction of pairs found in the smaller superfamilies.
Work on direct linkage provides a necessary background against which to examine indirect linkage. Here, the idea is to find out how many additional structurally related pairs can be found by considering a third, intermediate sequence linked to both. These indirect linkages, both true and false, are illustrated schematically in Figure 1. To find them, it was necessary to construct more sets of scop pairs, the same as those described previously, but now with all the pairs found by direct matching removed. These are called baseline sets.
Baseline set 1-3. This consists of 1742 pairs taken from the 2055 pairs in test set 1 with direct matches removed. It involves 697 sequences in total. It is based on a FASTA e-value cut-off of 10e-3. If this cut-off is changed, obviously the number of pairs will change, so one also has baseline sets 1-4 and 1-5 for cut-offs of 10e-4 and 10e-5, and so on (see Table 1).Indirect linkage: the improvement over the baseline
By definition, each of the sequences in the baseline sets has no sequence similarity to any other sequence within the same set. Consequently, it is now readily possible to gauge the improvement provided by transitive matching: any new pairs found constitute the improvement. A transitive match could, in principle, be through the sequences within the baseline sets, i.e. if pair AB and pair BC exist in set 1-3, but not pair AC, there would be an indirect link between A and C. As shown in Table 1, this occurs, but not that frequently. For instance, for baseline set 1-3, one can find 23 of the 1742 pairs.
One can find more transitive matches by considering the entire population of protein sequences as candidate `intermediate sequences'. Specifically, one can run the sequences in each of the baseline sets against the OWL composite databank (which contains all currently known protein sequences) and determine whether any of the homologues found in OWL linked a scop pair in the baseline sets. Used in this way, the sequences in the baseline sets are better thought of as cluster representatives for whole families than individual sequences. Each indirect link made between them is effectively between all the members of two distant families.
The results are summarized in Table 1. For an e-value threshold of 0.001 (baseline set 1-3), one can find 86 of the 1742 baseline pairs through indirect linkage (5%), with 13 false positives. This means that using both direct and indirect linkage, sequence comparison with FASTA can find about a fifth of the scop pairs (399 of 2055 in test set 1) with 16 total false positives, about one for every 25 true positives.
On the 862 closely aligned scop pairs (test set 2), the coverage improves significantly. In particular, transitive matching can find 74 of the 643 pairs in baseline set 2-3 (12%).
As shown in Table 2, the fraction of extra pairs found by transitive matching varies somewhat among the 171 scop superfamilies, with the larger families having a greater degree of improvement relative to the smaller ones. This is perhaps because direct matching was more successful with the smaller superfamilies and because the larger superfamilies are potentially associated with a larger and more diversified collection of intermediate sequences. For instance, indirect plus direct linkage can find 30% of the 120 pairs in the `FAD/NAD(P)-binding domain' superfamily (representative identifier d2tpra2), whereas direct matching can only find 10% (Table 3). Likewise, for the globin superfamily (d3sdha_), the comparable statistics are 63% of 91 pairs found, improving on 40%. In contrast, for the 70 scop superfamilies containing only a single pair, indirect plus direct linkage finds 47% of pairs, only a small improvement over the 41% found by direct matching alone.
Table 3.
Superfamily ID | No. of scop pairs |
No. of direct links |
Frac. linked directly |
No. of indirect or dir. links |
Frac. linked ind. or dir. |
d1pmy__ | 55 | 4 | 7% | 4 | 7% |
d1r69__ | 28 | 3 | 11% | 3 | 11% |
d1tssa1 | 28 | 0 | 0% | 0 | 0% |
d1yrnb_ | 55 | 8 | 15% | 8 | 15% |
d2ebn__ | 120 | 10 | 8% | 11 | 9% |
d2olba_ | 28 | 1 | 4% | 1 | 4% |
d2pgd_2 | 231 | 21 | 9% | 26 | 11% |
d2tgf__ | 28 | 13 | 46% | 13 | 46% |
d2tpra2 | 120 | 12 | 10% | 36 | 30% |
d2trxa_ | 36 | 4 | 11% | 4 | 11% |
d2yhx_2 | 28 | 0 | 0% | 1 | 4% |
d3cd4_2 | 153 | 3 | 2% | 3 | 2% |
d3dpva_ | 171 | 5 | 3% | 9 | 5% |
d3hhrb2 | 28 | 1 | 4% | 1 | 4% |
d3inkc_ | 55 | 0 | 0% | 0 | 0% |
d3sdha_ | 91 | 36 | 40% | 57 | 63% |
d3tgl__ | 28 | 1 | 4% | 2 | 7% |
d4icb__ | 36 | 8 | 22% | 14 | 39% |
d5cytr_ | 28 | 2 | 7% | 3 | 11% |
d5p21__ | 78 | 4 | 5% | 4 | 5% |
d5znf__ | 36 | 10 | 28% | 10 | 28% |
Qualifications
The specific `improvement' values quoted here for the effect of transitive sequence matching are intended to be representative of the performance of the method on a comprehensive data set using reasonable parameters. They are, nevertheless, contingent upon the selection of proteins in the test set (the scop classification), the particular comparison baselines established (i.e. an e-value cut-off of 0.001 for the direct linkage baseline) and the precise criteria for overlap (as discussed in Methods). These parameters have been selected in a reasonable fashion to exclude highly similar sequences and give a sense of how indirect sequence matching performs near the margin, in the `twilight zone'. The scop data set (Murzin et al., 1995), in particular, is a popular and well-documented set of similarities. It has been validated by both automatic and manual methods (Gerstein and Levitt, 1997), and should give the most comprehensive possible indication of how indirect matching performs on the whole range of known similarities, rather than just on specific families.
Moreover, while the exact improvement values quoted here may change somewhat with different choices for test data, baselines and overlap criteria, with any reasonable choices, transitive matching will be able to find additional real pairs without generating many false positives. That is, while the absolute values may change, the relative improvement will remain. This is shown to some degree in Table 1, where the improvement statistics for a variety of baselines are collected, e.g. 0.0001 and 0.00001. (One could develop this table further to build up a complete analysis of coverage versus error rate.) Furthermore, the entire test set and related data files (including all the precise similarity values from structural alignment) are available over the Web, thus enabling the analysis to be easily repeated with any parameters of one's choice.
The results reported here show that transitive sequence matching is an effective technique for improving the sensitivity of standard sequence comparison methods in searching for structural similarities. Specifically, one is able to find considerably more of the known structural similarities in a `gold-standard' set of test data (scop) by combining indirect linkage via an intermediate sequence with direct linkage than by direct matching alone. Moreover, there are few false positives. One can intuitively rationalize the success of transitive sequence matching as this approach makes use of the (presumably) more diversified outliers of a cluster in a search, instead of searching with the centroid (as is the case, for instance, for profiles).
The effectiveness of transitive sequence matching was measured here for the FASTA program, but a similar analysis could easily be carried out for the other popular sequence comparison approaches, such as profiles and HMMs (Bowie et al., 1991; Johnson et al., 1993; Eddy et al., 1994; Krogh et al., 1994). In fact, such analyses have recently been performed successfully by other groups (C.Chothia and J.Park, personal communication). It is expected that careful measurement of the effectiveness of sequence comparison methods for detecting structural similarities will allow these methods to be used as a baseline for assessing more elaborate fold recognition methods such as threading (Jones et al., 1992; Bryant and Lawrence, 1993; Jones and Thornton, 1996).
The authors of scop (Alexey G.Murzin, Bartlett G.Ailey, Steven E.Brenner, Tim J.P.Hubbard and Cyrus Chothia) are acknowledged for supplying the pdb40d dataset (via website) and suggestions on the manuscript. Further thanks go to C.Chothia and J.Park for communicating similar work to that done here prior to publication, to S.Brenner for providing a copy of his thesis on CD-ROM, to H.Hegyi and R.Rambo for carefully reading the manuscript; and to M.Levitt for suggestions on overlap criteria. The NSF is thanked for support (grant DBI-9723182).