Genome Research -- Hegyi and Gerstein 11 (10): 1632

Institution: YALE MEDICAL LIBRARY || Sign In as Individual || Contact Subscription Administrator at your institution || FAQ

Abstract of this Article

Reprint (PDF) Version of this Article

LETTER
Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins

Hedi Hegyi, and Mark Gerstein¹

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
REFERENCES

	ABSTRACT

Annotation transfer is a principal process in genome annotation. It involves "transferring" structural and functional annotationto uncharacterized open reading frames (ORFs) in a newly completedgenome from experimentally characterized proteins similar in sequence.To prevent errors in genome annotation, it is important that thisprocess be robust and statistically well-characterized, especiallywith regard to how it depends on the degree of sequence similarity.Previously, we and others have analyzed annotation transfer insingle-domain proteins. Multi-domain proteins, which make up thebulk of the ORFs in eukaryotic genomes, present more complex issuesin functional conservation. Here we present a large-scale surveyof annotation transfer in these proteins, using scop superfamiliesto define domain folds and a thesaurus based on SWISS-PROT keywordsto define functional categories. Our survey reveals that multi-domainproteins have significantly less functional conservation thansingle-domain ones, except when they share the exact same combinationof domain folds. In particular, we find that for multi-domainproteins, approximate function can be accurately transferred withonly 35% certainty for pairs of proteins sharing one structuralsuperfamily. In contrast, this value is 67% for pairs of single-domainproteins sharing the same structural superfamily. On the otherhand, if two multi-domain proteins contain the same combinationof two structural superfamilies the probability of their sharingthe same function increases to 80% in the case of complete coveragealong the full length of both proteins, this value increases furtherto > 90%. Moreover, we found that only 70 of the current totalof 455 structural superfamilies are found in both single and multi-domainproteins and only 14 of these were associated with the same functionin both categories of proteins. We also investigated the degreeto which function could be transferred between pairs of multi-domainproteins with respect to the degree of sequence similarity betweenthem, finding that functional divergence at a given amount ofsequence similarity is always about two-fold greater for pairsof multi-domain proteins (sharing similarity over a single domain)in comparison to pairs of single-domain ones, though the overallshape of the relationship is quite similar. Further informationis available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func.

	INTRODUCTION

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION REFERENCES

The ultimate goal of the genome projects is to determine the structure and function of all the newly identified gene products.Fundamentally, this will be carried out via annotation transfer,transferring the structural and functional annotation from anexperimentally characterized protein (as in a model organism suchas Escherichia coli) to a predicted protein in a newly sequencedgenome that shares similarity in sequence. The degree of annotationtransferred will depend on the degree of sequence similarity.This process is shown schematically in Figure 1. In this paper,we aim to address this major question in bioinformatics, specificallyfocusing on multi-domain proteins, as they make up the bulk ofthe proteome in eukaryotic organisms (Gerstein 1998).

View larger version (29K):
[in this window]
[in a new window]

Figure 1 Schematic illustrating annotation transfer. This figure illustrates the process of annotation transfer for a group of hypothetical TIM barrel proteins. The leftmost panel represents sequence comparisons between idealized barrel domains from a number of organisms. The next panel shows analogous results for structural comparison, and the panel after that, functional comparison. The rightmost panel represents sequence comparisons between idealized multi-domain proteins that match over a single domain, the subject of much of this paper.

Our work is a direct outgrowth of two previous analyses of ours that concentrated on single-domain proteins. In an earlierpaper, we found that the different structural classes of the scopclassification system have different propensities to carry outcertain types of function (Hegyi and Gerstein 1999). In particular,while the alpha/beta folds were disproportionately associatedwith enzymes and all-alpha and small folds with non-enzymes, thealpha + beta structures had an equal tendency for both enzymaticand non-enzymatic functions. Wilson et al. (2000) compared a largenumber of protein domains to one another in a pair-wise fashionwith respect to similarities in sequence, structure, and function.Using a hybrid functional classification scheme merging the ENZYMEand FlyBase systems (Gelbart et al. 1997; Bairoch 2000), theyfound that precise function is not conserved below 30-40% identity,although the broad functional class is usually preserved for sequenceidentities as low as 20-25%, given that the sequences have thesame fold. Their survey also reinforced the previously establishedgeneral exponential relationship between structural and sequencesimilarity (Chothia and Lesk 1986).

Other Work on Establishing Relationships between Sequence, Structure, and Function

Several other groups have studied the relationship between sequence, structure, and function in detail, attempting to determinethe extent to which functional transference between matching proteinsis feasible (Shah and Hunger 1997; Martin et al. 1998; Thorntonet al. 1999, 2000; Zhang et al. 1999; Shapiro and Harris 2000;Todd et al. 2001). Orengo et al. (1999) analyzed protein familiesin the CATH database and concluded that > 96% of the folds inthe PDB are associated with a single homologous family. By investigatingenzymatic folds they also found that more than 95% of homologousfamilies show either single or closely related functions. Pawlowskiet al. (2000) studied the relationship between sequence and functionalsimilarity in the twilight zone of 10%-15% sequence similarityand found a clear correlation between the two, with functionalsimilarity based on the E.C. classification ofenzymes.

Russell et al. (1997) analyzed binding sites in proteins with similar 3D structures and estimated that 90% of new remote homologhave common binding sites and similar functions. Eisenstein etal. (2000) evaluated the first results from the structural genomicsprojects and found that in many instances the protein structureitself offers an important clue to its biological function. Stawiskiet al. (2000) found that function could be predicted rather successfullyfor just the proteases. Devos and Valencia (2000) presented acritical view of function transference between similar sequences,highlighting the limitations of this process due to errors indatabases and the inherent complexity of the relationship betweenprotein sequence-structure and function that does not allow "simplisticinterpretations." They also found that binding sites are the leastconserved features between related proteins while the catalyticactivity of enzymes is the most conservedone.

Multi-Domain Proteins with Divergent Functions: How Common?

Most of these previous investigations focused on single-domain proteins or did not distinguish between single- and multi-domainones. It is not clear how the multi-domain proteins with variousfunctions behave with respect to functional conservation; namely,whether they are more or less conserved than their single-domaincounterparts. In particular, as shown in Figure 1, if one multi-domainprotein shares a single domain fold with another one, it is notclear the degree to which the functional conservation of theseproteins is constrained by the shared part, and to what degreeit is influenced by other domains that are notshared.

Specific groups of proteins that have the same combination of structural domains but dramatically different functions illustratethis situation. One example is the combination of the SH3-domain(scop superfamily identifier 2.24.2) and the P-loop containingNTP hydrolase (3.29.1). While in higher organisms this combinationis associated with presynaptic and tumor suppressor functions(SWISS-PROT names SP02_HUMAN and DLGI_DROME, respectively), inthe lower Dictyostelium it was found in myosin (MYSP_DICDI). Anotherexample is the combination of the FAD/NAD(P)-binding superfamilyand FAD-linked reductases C-terminal superfamily (3.4.1 and 4.12.1superfamilies, respectively). In one group of proteins they appearin enzymes of the oxidoreductase group (e.g. OXDA_CAEEL or PHHY_PSEAE),while in another they are found in a dissociation inhibitor (e.g.GDIA_HUMAN). It should be noted that the proteins are not coveredcompletely by the structural matches, so it is quite possiblethat the rest of them contain totally different domains that areresponsible for the dramatically different functions. However,do these two examples show a rather rare or a more frequent phenomenon?How often do multi-domain proteins, sharing the same structuraldomain composition, differ in theirfunctions?

In this paper, we attempt to provide a comprehensive answer to this question. This is particularly timely given that mostof the unknown proteins in eukaryotic genomes are multi-domain.We use the same approach as in our previous analyses, comparingthe sequences of the structural domains in scop to those of SWISS-PROTusing BLASTP. We focus on the functional divergence ofsingle and multi-domain proteins, extending previous investigationsof single-domain proteins. Also, in comparison to previous work,we focus more on non-enzymatic functions and scop structural superfamilies,instead offolds.

	RESULTS

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION REFERENCES

Our Approach to Functional and Structural Assignment

We used the BLASTP program (version 2.0) (Altschul et al. 1997) to identify the scop 1.39 (Murzin et al. 1995) structuraldomains in SWISS-PROT (version 37) (Bairoch and Apweiler 2000)with e = 10⁴. We removed the hypothetical and fragment proteins. This resultedin two sets ofproteins.

Single-Domain

Of the single-domain matches, only those that were almost completely covered with a match to a single structural domain wereselected. (The maximum number of uncovered residues was set at70 with an additional condition that a maximum of 40 residueson the N-terminal end and 30 residues on the C-terminus were allowedto be uncovered.) These criteria resulted in 1818 single-domainproteins being selected from SWISS-PROT.

Multi-Domain

We selected 4763 multi-domain proteins from SWISS-PROT. All of these matched (in different locations) at least two domainsof known structure belonging to different scop superfamilies (seeschematic in Figure 1). We also selected a subset of these proteinsthat have almost their entire length covered by matches with structuraldomains (allowing again a maximum of 70 uncovered residues). Thisselection resulted in 2829 proteins being selected from SWISS-PROT.(In all cases, duplicate matches were removed, i.e., a proteinat a certain location matches only one structural domain.)

We set out to compare these two sets of proteins for functional divergence. As previously, we divided functions into enzymeand non-enzyme (Hegyi and Gerstein 1999

). Enzymatic functionswere classified by the EC system (Bairoch 2000

). Comparisons ofenzymatic functions were treated the same way as in our earlieranalyses, that is, if they differ in the first three componentsof their respective EC numbers, they were considered different.This implied that our analysis dealt with a total of 112 enzymaticfunctions. Non-enzymatic functions were classified into 508 differentcategories based on a simple thesaurus we assembled of synonymouskeywords drawn from SWISS-PROT description lines. In addition,we created 49 categories for functions that have an enzymaticcomponent but which are not part of the EC system. This gave usa total of 669 functions (112 + 508 + 49). (The list of all thefunctional categories is described further in Table 2 below,and also can be found on the Web at http://bioinfo.mbb.yale.edu/partslist/funcor http://partslist.org/func.)

Overall Distribution of the Matches

Figure 2 shows the most commonly observed multi-domain combinations in a set of recently sequenced genomes. The occurrencesof further combinations are available from the Web site. Clearly,the distribution is very skewed, with certain combinations, suchas 3.29-2.32, and 2.29-4.61 tending to predominate.

View larger version (94K):
[in this window]
[in a new window]

Figure 2 Distribution of multi-domain combinations amongst the genomes. The figure shows the occurrence of multi-domain fold combinations in a number of genomes, indicating its great variability. Each row indicates a particular combination of scop fold pairs (using scop 1.39), where a fold pair is defined as two distinct folds occurring in tandem in a protein. Each column represents a different genome, using the four-letter codes in the PartsList system (Qian et al. 2001

): Aaeo, Aquifex aeolicus; Aful, Archaeoglobus fulgidus; Bbur, Borrelia burgdorferi; Bsub, Bacillus subtilis; Cele, Caenorhabditis elegans; Cpne, Chlamydia pneumoniae; Ctra, Chlamydia trachomatis; Ecol, Echerischia coli; Hinf, Haemophilus influenzae Rd; Hpyl, Helicobacter pylori; Mthe, Methanobacterium thermoautotrophicum; Mjan, Methanococcus jannaschii; Mtub, Mycobacterium tuberculosis; Mgen, Mycoplasma genitalium; Mpne, Mycoplasma pneumoniae; Phor, Pyrococcus horikoshii; Rpro, Rickettsia prowazekii; Scer, Saccharomyces cerevisiae; Syne, Synechocystis sp.; Tpal, Treponema pallidum. The numbers in each intersection cell indicate the number of times the fold pairs occur in a genome. Only the 20 most common fold pair combinations are shown here; the remainder are shown on the Web site (http://partslist.org/func). If a cell is greater than 6, it is shaded black; between 3 and 6, gray; and below 3, white. The blank spaces show instances in which one of the pairs does not occur in the organism at all (indicated by a value of -1 in the data table on the Web site). The fold assignments are done in a fashion consistent with those in PartsList and associated systems (Gerstein 1997

; Lin et al. 2000

; Drawid et al. 2001

; Harrison et al. 2001

; Qian et al. 2001

Figure 3 shows the overall distribution of the single-domain and multi-domain matches in the different structural classes.The distribution of matches between enzymes and non-enzymes inmulti-domain proteins largely agrees with that in the single-domainproteins. The multi-domain matches follow the overall tendencyof the alpha/beta folds to be associated with enzymes to a largerextent and the all-alpha and small folds with non-enzymes. However,the values for the multi-domain matches are generally less extremethan for single-domains; for example, the 10-fold difference betweensingle-domain alpha/beta enzymes and non-enzymes decreases toabout twofold in multi-domain proteins. Another significant differenceis the reduction in the number of multi-domain non-enzymes inthe all-beta and alpha + beta structural classes compared to thesingle-domain matches. Altogether, there are more enzymes thannon-enzymes among the multi-domain proteins (2805 enzymes vs.1958 non-enzymes) whereas for single-domain proteins, the oppositeis true (850 enzymes vs. 968 non-enzymes).

View larger version (38K):
[in this window]
[in a new window]

Figure 3 Distribution of proteins amongst broad structural and functional classes; the distribution of the matches among the seven structural and two functional classes in single- and multi-domain proteins. The single-domain and multi-domain matches each total 100%, independently of each other. The horizontal axis indicates the seven scop classes, which are (from 1 to 7): all-alpha, all-beta, alpha/beta, alpha + beta, multi-domain, membrane, and small protein.

Table 1 summarizes the distribution of superfamilies and superfamily combinations among the major functional classes, i.e.whether they have only enzymatic, only non-enzymatic or both enzymaticand non-enzymatic functionality. Altogether, 215 superfamilieswere found in single-domain proteins and 310 in multi-domain ones.As 70 superfamilies were found in both, altogether 455 distinctstructural superfamilies matched a SWISS-PROT protein with ourrequired coverage criteria (described above). Similarly, we apportionedthe 281 superfamily combinations observed in multi-domain proteinsamongst different broad functional categories.

View this table:
[in this window]
[in a new window]

Table 1. Functional Distribution of Single-domain, Multi-domain Superfamilies, and Multi-domain Combinations

In single-domain proteins there are about as many superfamilies with exclusively enzymatic functionality as there are thosewith exclusively non-enzymatic functions (82 vs. 78). In contrast,in multi-domain proteins this ratio increases to almost threefold(135 vs. 56). This agrees with the notion that most enzymes aremulti-domain. Another difference between single and multi-domainproteins appears in the ratio of superfamilies with a single functioncompared to multifunctional ones. As it is apparent from Table1, about a quarter of the superfamilies matched single-domainproteins with different functions (55 of 215), whereas in themulti-domain proteins, this ratio increased to more than a third(119 of310).

Single-Domain Proteins

Table 2 lists the two functionally most diverse structural superfamilies in single-domain proteins with some representativefunctions. The most diverse superfamily, the 3.38.1 Thioredoxin-like,has 11 different functions associated with it, most of them withan oxidoreductase mechanism. For instance, THIO_BPT4 is a smalldisulphide-containing thioredoxin that serves as a general disulphideoxidoreductase, while TDX2_BRUMA is almost twice as long (199aa) and serves as a thiol-specific antioxidant that acts againstsulfur-containing radicals. Another interesting example of functionaldiversity is provided by the Scorpion toxin-like superfamily (7.3.6).While BRAZ_PENBA is a small protein that is known to be 2000 timessweeter than sucrose, the other members of the superfamily areassociated with different host-defense mechanisms. In insectsthe superfamily possesses antifungal activity (DMYC_DROME) oracts as a toxin (SCX5_BUTEU). Interestingly, in plants it canalso act as an antifungal (AF2B_SINAL) or as an inhibitor of insectalpha-amylases (SIA1_SORBI). It appears that many single-domainproteins are toxins or allergens, or are related in other waysto a host-defense response.

View this table:
[in this window]
[in a new window]

Table 2. Most Versatile Single-Domain Superfamilies

Based on the data we can also determine the probability of two single-domain proteins that match domains in the same superfamilycategory also carrying out the same function. Using Bayes' theorem:

P(F‖S) = P(F)P(S‖F)/((P(F)P(S‖F) +
P(∼F)P(S‖∼F)) (1)

where S is the probability that two proteins share the same superfamily, F is the probability that two proteins have thesame function, and ^~F is the probability that two proteins do not have the same function.Rearranging and simplifying the equation we get:

P(F‖S) = 1/(1 + N(S,∼F)/(N(S,F)) (2)

where N is the number of times that the two events in the parentheses occur together in our database of 1818 single-domainproteins. This results in

P(F‖S) = 1/(1 + 8501/12516) = 68%.

That is, the probability that two single-domain proteins that have the same superfamily structure have the same function(whether enzymatic or not) is about 2/3.

Multi-Domain Proteins

Table 3 lists the combinations of superfamilies that have been associated with the greatest number of different functionsin multi-domain proteins, with representative entries in SWISS-PROT.The combination with the greatest number of different functionsis that of 1.95.1 and 7.33.1. Although it has twice as many differentfunctions as the most diverse superfamily in the single-domainproteins (22 vs. 11, respectively), careful examination revealsthat all the proteins in this category are DNA-binding and mostof them act as hormone receptors.

View this table:
[in this window]
[in a new window]

Table 3. Most Versatile Superfamily Combinations in Multi-Domain Proteins

The second entry listed in the table is the combination of the 3.4.1 and 4.48.1 superfamilies associated with the FAD/NAD(P)-linkedreductases. It is an all-enzymatic combination and always carriesout an oxido-reductase function. All the proteins in this categoryare completely covered by matches with these two superfamilies.The 1.78.1-2.1.1 hemocyanin-immunoglobulin combination seems alsoto be fairly conserved; although the proteins in this categoryare called by eight different names, most of them turn out tobe extracellular larval storage proteins, except for the copper-containingoxygen carrier hemocyanin itself (HCY_PALVU).

Following the same logic, we can also determine the probability that two proteins that have the same superfamily combinationshare the same function, viz:

P(F‖S) = 1/(1 + 32242/134230) = 81%

This means that we have significantly greater certainty in determining the function of a multi-domain protein with a particularsuperfamily combination than that of a single-domain protein containinga particular superfamily. We also determined a similar probabilityfor those proteins that have an almost complete coverage withexactly the same type and number of superfamilies, following eachother in the same order. The probability that the functions arethe same in this case was 91%, a considerably higher value thanabove. However, if two multi-domain proteins share only a singlesuperfamily, the probability that they share the same functiondrops to only 35%! This greater functional certainty from sharinga combination of superfamilies rather than just one is also reflectedin Table 1. While one-fourth of the single-domain proteins andone-third of singularly matching superfamilies in multi-domainproteins have multiple functions, only about one-fifth of themulti-domain combinations possess multiple functions (60 of 281).It is also clear from the data that domains in larger proteinsoften lose their original function and no longer have an autonomousfunction.

Seventy Common Superfamilies and Their Functions Compared in Single-Domain and Multi-Domain Proteins

As mentioned above, of the 455 superfamilies in our analysis, only 70 occur in both single- and multi-domain proteins. Evenmore surprising is the small number of structural superfamilies(14) that have the same function in both single- and multi-domainproteins. These are listed in Table 4; 12 of them have enzymaticfunction, supporting the notion that enzymes are more conservedduring evolution than non-enzymes. The two non-enzymatic superfamiliesare the 4.29.1 ribosomal superfamily and the 5.4.1 superfamilyin penicillin-binding proteins.

View this table:
[in this window]
[in a new window]

Table 4. Superfamilies With the Same Function in Single- and Multi-Domain Proteins as Determined from Their Keyword Combination or First Three Components of Their EC Numbers

Table 5 presents several examples of the converse situation, shared superfamilies that have different functions in singleand multi-domain proteins. Comparing parts A and B of the tablehighlights the fact that although both superfamilies in a multi-domainprotein are often present in single-domain form as well, the functionsin the different settings are only vaguely related. One exampleis the combination of the lipocalin superfamily (2.45.1) withthat of the BPTI-like or Kunitz inhibitor (7.7.1), which in higherorganisms forms a complex protein called alpha-1-microglobulin(AMBP_RAT). Another interesting example is the combination ofthe 2.5.1 Cupredoxin (occurring in the single-domain blue-copperprotein, SOXE_SULAC) and the 6.5.1 Membrane all-alpha (single-domainrepresentative: BACT_HALVA, a sensory rhodopsin) superfamiliesinto a component of the respiratory chain, cytochrome C oxidaseII (COOX_ZOOAN). All these examples demonstrate the evolutionaryadvantage of a domain fusion event, which creates a function thatis more complex than either of the components.

View this table:
[in this window]
[in a new window]

Table 5. Examples of Superfamilies Present in Both Single- and Multi-Domain Proteins, Carrying out Different Functions

Multifunctionality vs. Sequence Similarity

Previously, we presented a variety of graphs that show how the probability that two domains would share the same functionvaried with respect to sequence similarity (Hegyi and Gerstein1999; Wilson et al. 2000). Figure 4 shows a similar graph withthe calculations extended to multi-domain proteins. The figureshows that the functional divergence of a single domain in multi-domainproteins dramatically increases, more than twofold, compared tothe single-domain ones. This reinforces our findings above, basedonly on superfamily content, that the certainty with which wecan predict the function of a protein based on its sequence similaritywith a domain in another multi-domain protein, is considerablyless than for a comparable single-domainsituation.

	DISCUSSION

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION REFERENCES

Here we built on our previous studies on the relationship between protein structure and function to develop new results relatedto multi-domain proteins. Throughout the paper, we focused onsuperfamilies instead of folds, as the members of a superfamilyare presumably of common evolutionary origin (Murzin et al. 1995).

We found that the 4763 multi-domain and 1818 single-domain proteins that met our selection criteria have about the same distributionof structural classes, with more enzymatic functions associatedwith the alpha/beta structural classes and more non-enzymaticones with the all-alpha and small classes. We identified morethan three times as many multi-domain proteins that were enzymesthan single-domain ones (2805 and 850, respectively) and, conversely,about twice as many multi-domain proteins as single-domain onesthat were non-enzymes (1958 vs.968).

We focused on the functional divergence of the two groups and found that about a quarter of the superfamilies in single-domainproteins are associated with multiple functions, whereas onlyabout a fifth of the multi-domain superfamily combinations are.Therefore, we can conclude that a combination of specific superfamiliesresults in a more specific functional assignment for a particularprotein. However, about one-third of the superfamilies in themulti-domain proteins were associated with multiple functions,underlining the lesser autonomy of a domain function in multi-domainprotein.

This latter finding was also supported by the difference in functional divergences between the two groups of proteins basedon particular sequence similarities between the domains and SWISS-PROTproteins. As is shown in Figure 4, the average functional divergenceof a single domain is much larger (more than twofold) in multi-domainproteins than in single-domain ones.

View larger version (27K):
[in this window]
[in a new window]

Figure 4 Divergence in function with respect to sequence similarity. Relative number of matching domains with multiple functions, as the function of e-value threshold. Diamonds represent single-domain proteins, squares multi-domain ones (matching just for a single domain), respectively. The first value on the X-axis starts at 4 (corresponding to an e-value=10⁴).

We also found that only 70 of a total of 455 superfamilies are shared between the multi-domain and single-domain proteinsand only a small fraction (14) share their functions. This wasrather surprising to us, and should be taken into considerationin functional characterization and annotation of new gene products.When the functions were related in single- and multi-domain proteins,we could observe an increasing functional complexity with theappearance of large multi-domainproteins.

Altogether, with the recent sequencing of the human genome and the genomes of other model organisms, we hope that this workcan contribute to the successful annotation of the individualgene products, and will help to avoid some pitfalls associatedwith the functional characterization of large, complexproteins.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734solely to indicate thisfact.

	FOOTNOTES

¹ Correspondingauthor.

E-MAIL Mark.Gerstein@yale.edu

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.183801.

REFERENCES

TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
REFERENCES

	REFERENCES

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402[Abstract/Full Text].
Bairoch, A. 2000. The ENZYME database in 2000. Nucleic Acids Res. 28: 304-5[Abstract/Full Text].
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45-8[Abstract/Full Text].
Chothia, C. and Lesk, A. M. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5: 823-826[Abstract].
Devos, D. and Valencia, A. 2000. Practical limits of function prediction. Proteins 41: 98-107[Medline].
Drawid, A. and Gerstein, M. 2000. A Bayesian system integrating expression data with sequence patterns for localizing proteins: Comprehensive application to the yeast genome. J. Mol. Biol. 301: 1059-1075[Medline].
Eisenstein, E., Gilliland, G. L., Herzberg, O., Moult, J., Orban, J., Poljak, R. J., Banerjei, L., Richardson, D., and Howard, A. J. 2000. Biological function made crystal clear - annotation of hypothetical proteins via structural genomics. Curr. Opin. Biotechnol. 11: 25-30[Medline].
Gelbart, W. M., Crosby, M., Matthews, B., Rindone, W. P., Chillemi, J., Russo Twombly, S., Emmert, D., Ashburner, M., Drysdale, R. A. 1997. FlyBase: A Drosophila database. The FlyBase consortium. Nucleic Acids Res. 25: 63-6[Abstract/Full Text].
Gerstein, M. 1997. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274: 562-76[Medline].
-----. 1998. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des. 3: 497-512[Medline].
Harrison, P., Echols, N., and Gerstein, M. 2001. Digging for dead genes: An analysis of the characteristics of the pseudogene population in the C. elegans genome. Nucleic Acids Res. 29: 818-830[Abstract/Full Text].
Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147-164[Medline].
Lin, J. and Gerstein, M. 2000. Whole-genome trees based on the occurrence of folds and orthologs: Implications for comparing genomes on different levels. Genome Res. 10: 808-818[Abstract/Full Text].
Martin, A. C., Orengo, C. A., Hutchinson, E. G., Jones, S., Karmirantzou, M., Laskowski, R. A., Mitchell, J. B., Taroni, C., and Thornton, J. M. 1998. Protein folds and functions. Structure 6: 875-884[Medline].
Murzin, A., Brenner, S. E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins for the investigation of sequences and structures. J. Mol. Biol. 247: 536-540[Medline].
Orengo, C. A., Pearl, F. M., Bray, J. E., Todd, A. E., Martin, A. C., Lo Conte, L., and Thornton, J. M. 1999. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res. 27: 275-279[Abstract/Full Text].
Pawlowski, K., Jaroszewski, L., Rychlewski, L. and Godzik, A. 2000. Sensitive sequence comparison as protein function predictor. Pac. Symp. Biocomput. 42-53.
Pearson, W. R. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 25: 365-389[Medline].
Qian, J., Stenger, B., Wilson, C., Lin, J., Jansen, R., Krebs, W., Alexandrov, V., Echols, N., Teichmann, S., Park, J. 2001. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res. 29: 1750-1764[Abstract/Full Text].
Russell, R. B., Saqi, M. A., Sayle, R. A., Bates, P. A., and Sternberg, M. J. 1997. Recognition of analogous and homologous protein folds: Analysis of sequence and structure conservation. J. Mol. Biol. 269: 423-439[Medline].
Shah, I. and Hunter, L. 1997. Predicting enzyme function from sequence: A systematic appraisal. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5: 276-283[Medline].
Shapiro, L. and Harris, T. 2000. Finding function through structural genomics. Curr. Opin. Biotechnol. 11: 31-5[Medline].
Stawiski, E.W., Baucom, A.E., Lohr, S.C., and Gregoret, L.M. 2000. Predicting protein function from structure: Unique structural features of proteases. Proc. Natl. Acad. Sci. 97: 3954-8[Abstract/Full Text].
Thornton, J. M., Orengo, C. A., Todd, A. E., and Pearl, F. M. 1999. Protein folds, functions and evolution. J. Mol. Biol. 293: 333-342[Medline].
Thornton, J. M., Todd, A. E., Milburn, D., Borkakoti, N., and Orengo, C. A. 2000. From structure to function: Approaches and limitations. Nat. Struct. Biol. 7 Suppl: 991-994.
Todd, A.E., Orengo, C.A., and Thornton, J.M. 2001. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307: 1113-1143[Medline].
Wilson, C. A., Kreychman, J., and Gerstein, M. 2000. Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297: 233-249[Medline].
Zhang, B., Rychlewski, L., Pawlowski, K., Fetrow, J. S., Skolnick, J., and Godzik, A. 1999. From fold predictions to function predictions: Automation of functional site conservation analysis for functional genome predictions. Protein Sci. 8: 1104-1115[Abstract].

Received February 7, 2001; accepted in revised form June 19, 2001.

Abstract of this Article

Reprint (PDF) Version of this Article

LETTER Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins

LETTER
Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins