Measurement of the effectiveness of transitive sequence comparison, through a third `intermediate' sequence</atl>

Bioinformatics

Pages 707-714

Measurement of the effectiveness of transitive sequence comparison, through a third `intermediate' sequence
Introduction
Methods
Results
Conclusion
Acknowledgements
References

Measurement of the effectiveness of transitive sequence comparison, through a third `intermediate' sequence

Mark Gerstein

Department of Molecular Biophysics and Biochemistry, 266 Whitney Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA

Received on December 1, 1997; revised on July 10, 1998; accepted on July 11, 1998

Abstract

Motivation: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, `intermediate' sequence (Q -> I -> M). This approach has often been suggested as providing greater sensitivity in sequence comparison; however, it has not yet been possible to gauge its improvement precisely.
Results: Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called `pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called `baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison (i.e. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0.001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular, 862 of the baseline set pairs fit to better than 2.6 Å RMS, and transitive matching can find 62 of these (9%).
Availability: All the test data, including precise similarity values calculated from structural alignment, are available in tabular format over the Web from http://bioinfo.mbb.yale.edu/align
Contact: Mark.Gerstein@yale.edu

Introduction

Transitive sequence matching is an approach taken toward improving sequence comparison. It entails taking the matches found after running a sequence comparison and then re-running them as new queries against the databank. The resulting matches consist of many of the previous matches plus (hopefully) some new ones. These new matches are, in turn, related back to the initial query only indirectly through an intermediate sequence (see TIL in Figure 1). This whole process can be repeated again, iteratively, with these new matches.

FASTA e-value cut-off	Type of pair	Pairs in total: test sets	Linked directly	Pairs remaining: baseline sets	Linked indirectly within dataset	Linked indirectly via OWL sequence
1.0E-05	TPs	2055	220 11%	1835	19 1.0%	67 3.7%
	Low-RMS TPs	862	162 19%	700	15 2.1%	62 8.9%
	FPs		0		0	12
1.0E-04	TPs	2055	271 13%	1784	28 1.6%	73 4.1%
	Low-RMS TPs	862	198 23%	664	17 2.6%	64 9.6%
	FPs		1		0	13
1.0E-03	TPs	2055	313 15%	1742	23 1.3%	86 4.9%
	Low-RMS TPs	862	219 25%	643	18 2.8%	74 12%
	FPs		3		1	13

Pairs per superfamily	No. of super-families	Total no. of scop pairs	No. linked directly	Frac. linked directly	No. of indirect or dir. links	Frac. linked ind. or dir.
231	1	231	21	9%	26	11%
171	1	171	5	3%	9	5%
153	1	153	3	2%	3	2%
120	2	240	22	9%	47	20%
91	1	91	36	40%	57	63%
78	1	78	4	5%	4	5%
55	3	165	12	7%	12	7%
36	3	108	22	20%	28	26%
28	8	224	21	9%	24	11%
21	6	126	45	36%	53	42%
15	4	60	7	12%	8	13%
10	11	110	15	14%	16	15%
6	17	102	29	28%	33	32%
3	42	126	42	33%	46	37%
1	70	70	29	41%	33	47%

Total	171	2055	313	15%	399	19%

Superfamily ID	No. of scop pairs	No. of direct links	Frac. linked directly	No. of indirect or dir. links	Frac. linked ind. or dir.
d1pmy__	55	4	7%	4	7%
d1r69__	28	3	11%	3	11%
d1tssa1	28	0	0%	0	0%
d1yrnb_	55	8	15%	8	15%
d2ebn__	120	10	8%	11	9%
d2olba_	28	1	4%	1	4%
d2pgd_2	231	21	9%	26	11%
d2tgf__	28	13	46%	13	46%
d2tpra2	120	12	10%	36	30%
d2trxa_	36	4	11%	4	11%
d2yhx_2	28	0	0%	1	4%
d3cd4_2	153	3	2%	3	2%
d3dpva_	171	5	3%	9	5%
d3hhrb2	28	1	4%	1	4%
d3inkc_	55	0	0%	0	0%
d3sdha_	91	36	40%	57	63%
d3tgl__	28	1	4%	2	7%
d4icb__	36	8	22%	14	39%
d5cytr_	28	2	7%	3	11%
d5p21__	78	4	5%	4	5%
d5znf__	36	10	28%	10	28%