The Relationship between 
Protein Structure and Function: 
a Comprehensive Survey 
with Application to the Yeast Genome






Hedi Hegyi 

&

Mark Gerstein
















Department of Molecular Biophysics & Biochemistry
266 Whitney Avenue, Yale University
PO Box 208114, New Haven, CT 06520
(203) 432-6105, FAX (203) 432-5175
Mark.Gerstein@yale.edu


(Version ff225rev sent to the Journal of Molecular Biology)


ABSTRACT

For most proteins in the genome databases, function is predicted via sequence comparison. In spite 
of the popularity of this approach, the extent to which it can be reliably applied is unknown. We 
address this issue by systematically investigating the relationship between protein function and 
structure. We focus initially on enzymes classified by the Enzyme Commission (EC) and relate 
these to structurally classified proteins in the SCOP database. We find that the major SCOP fold 
classes have different propensities to carry out certain broad categories of functions. For instance, 
alpha/beta folds are disproportionately associated with enzymes, especially transferases and 
hydrolases, and all-alpha and small folds with non-enzymes, while alpha+beta folds have an equal 
tendency either way. These observations for the database overall are largely true for specific 
genomes. We focus, in particular, on yeast, analyzing it with many classifications in addition to 
SCOP and EC (i.e. COGs, CATH, MIPS), and find clear tendencies for fold-function association, 
across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the 
functions of the most ancient proteins are more evenly distributed among different structural classes 
than those of more modern ones. For the database overall, we identify both most versatile functions, 
i.e. those that are associated with the most folds, and most versatile folds, associated with the most 
functions. The two most versatile enzymatic functions  (hydro-lvases and O-glycosyl glucosidases) 
are associated with 7 folds each. The five most versatile folds (TIM-barrel, Rossmann, ferredoxin, 
alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta structures. They stand 
out as generic scaffolds, accommodating from 6 to as many as 16 functions (for the exceptional 
TIM-barrel). At the conclusion of our analysis we are able to construct a graph giving the chance 
that a functional annotation can be reliably transferred at different degrees of sequence and 
structural similarity.  Supplemental information is available from 
http://bioinfo.mbb.yale.edu/genome/foldfunc. 

INTRODUCTION 

The Problem of Determining Function from Sequence

An ultimate goal of genome analysis is to determine the biological function of all the gene products 
in a genome. However, the function of only a minor fraction of proteins has been studied 
experimentally, and, typically, prediction of function is based on sequence similarity with proteins 
of known function. That is, functional annotation is transferred based on similarity. Unfortunately, 
the relationship between sequence similarity and functional similarity is not as straightforward. This 
has been commented on in numerous reviews (Bork & Koonin, 1998; Karp, 1998).  Karp (1998), in 
particular, has noted that transferring of incorrect functional information threatens to progressively 
corrupt genome databases through the problem of accumulating incorrect annotations and using 
them as a basis for further annotations and so on.

It is known that sequence similarity does confer structural similarity. Moreover, there is a well-
established quantitative relationship between the extent of similarity in sequence and that in 
structure. First investigated by Chothia & Lesk, the similarity between the structures of two proteins 
(in terms of RMS) appears to be a monotonic function of their sequence similarity (Chothia & Lesk, 
1986). This fact is often exploited when two sequences are declared related, based on a database 
search by programs such as BLAST or FastA (Altschul et al., 1997; Pearson, 1996). Often, the only 
common element in two distantly related protein sequences is their underlying structures, or folds. 

Transitivity requires that the well-established relationship between sequence and structure and the 
more indefinite one between sequence and function imply an indefinite relationship between 
structure and function. Several recent papers have highlighted this, analyzing individual protein 
superfamilies with a single fold but diverse functions. Examples include the aldo-keto reductases, a 
large hydrolase superfamily, and the thiol protein esterases. The latter include the eye-lens and 
corneal crystallins, a remarkable example of functional divergence (Bork & Eisenberg, 1998; Bork 
et al., 1994; Cooper et al., 1993; Koonin & Tatusov, 1994; Seery et al., 1998).

There are also many classic examples of the converse: the same function achieved by proteins with 
completely different folds. For instance, even though mammalian chymotrypsin and bacterial 
subtilisin have different folds, they both function as serine proteases and have the same Ser-Asp-
His catalytic triad. Other examples include sugar kinases, anti-freeze glycoproteins, and lysyl-tRNA 
synthetases (Bork et al., 1993; Chen et al., 1997; Doolittle, 1994; Ibba et al., 1997a; Ibba et al., 
1997b).

Figure 1 shows well-known examples of each of these two basic situations: same fold but different 
function (divergent evolution) and same function but different fold (convergent evolution).

Protein Classification Systems

The rapid growth in the number of protein sequences and 3D structures has made it practical and 
advantageous to classify proteins into families and more elaborate hierarchical systems. Proteins are 
grouped together on the basis of structural similarities in the FSSP, (Holm & Sander, 1998) CATH 
(Orengo et al., 1997), and SCOP databases (Murzin et al., 1995). SCOP is based on the judgments 
of a human expert; FSSP, on automatic methods; and CATH, on a mixture of both. Other databases 
collect proteins on the basis of sequence similarities to one another -- e.g. PROSITE, SBASE, 
Pfam, BLOCKS, PRINTS and ProDom (Attwood et al., 1998; Bairoch et al., 1997; Corpet et al., 
1998; Fabian et al., 1997; Henikoff et al., 1998; Sonnhammer et al., 1997). Several collections 
contain information about proteins from a functional point of view. Some of these focus on 
particular organisms - e.g. the MIPS functional catalogue and YPD for yeast (Mewes et al., 1997; 
Hodges et al., 1998) and EcoCyc and GenProtEC for E. coli (Karp et al., 1998; Riley, 1997).  
Others focus on particular functional aspects in multiple organisms. - e.g. the WIT and Kegg 
databases which focus on metabolism and pathways (Selkov et al., 1997; Ogata et al., 1999), the 
ENZYME database which focuses obviously enough on enzymes (Bairoch, 1996), and the COGs 
system which focuses on proteins conserved over phylogenetically distinct species (Tatusov et al., 
1997). The ENZYME database, in particular, contains all the enzyme reactions that have an “EC 
number” assigned in accordance with the International Nomenclature Committee and is cross-
referenced with Swissprot (Bairoch, 1996; Bairoch & Apweiler, 1998; Barrett, 1997). 

Our approach: Systematic Comparison of Proteins Classified by Structure with those 
Classified by Function

One of the most valuable operations one can do to these individual classification systems is to 
cross-reference and cross-tabulate them, seeing how they overlap. We perform such an analysis here 
by systematically interrelating the SCOP, Swissprot and ENZYME databases (Bairoch, 1996; 
Bairoch & Apweiler, 1998; Murzin et al., 1995). For yeast we also have used the MIPS yeast 
functional catalogue, CATH, and COGs in our analysis. This enables us to investigate the 
relationship between protein function and structure in a comprehensive statistical fashion. In 
particular, we investigated the functional aspects of both divergent and convergent evolution, 
exploring cases where a structure gains a dramatically different biochemical function and finding 
instances of similar enzymatic functions performed by unrelated structures. 

We concentrated on single-domain Swissprot proteins with significant sequence similarity to one of 
the SCOP structural domains. Since most of these proteins have a single assigned function, 
comparing them to individual structural domains, which can have only one assigned fold, allowed 
us to establish a one-to-one relationship between structure and function.

Recent Related Work

This work is following up on several recent papers on the relationship between protein structure and 
function. In particular, Martin et al. studied the relationship between enzyme function and the 
CATH fold classification (Martin et al., 1998). They concluded that functional class (expressed by 
top-level EC numbers) is not related to fold, since a few specific residues, not the whole fold, 
determine enzyme function. Russell also focused on specific sidechain patterns, arguing that these 
could be used to predict protein function (Russell, 1998). In a similar fashion, Russell et al. 
identified structurally similar “supersites” in superfolds (Russell et al., 1998). They estimated that 
the proportion of homologues with different binding sites -- and therefore with different functions -- 
is around 10%. In a novel approach, using machine learning techniques, des Jardins et al. predict 
purely from the sequence whether a given protein is an enzyme and also the enzyme class to which 
it belongs (des Jardins et al., 1997).

Our work is also motivated by recent work looking at whether or not organisms are characterized by 
unique protein folds (Frishman & Mewes, 1997; Gerstein, 1997; Gerstein & Hegyi, 1998; Gerstein 
& Levitt, 1997; Gerstein, 1998a,b). If function is closely associated with fold (in a one-to-one 
sense), one would think that when a new function arose in evolution, nature would have to invent a 
new fold. Conversely, if fold and function are only weakly coupled, one would expect to see a more 
uniform distribution of folds amongst organisms and a high incidence of convergent evolution. In 
fact, a recent paper on microbial genome analysis claims that functional convergence is quite 
common (Koonin & Galperin, 1997). Another related paper systematically searched Swissprot for 
all such cases of what is termed “analogous” enzymes (Galperin et al., 1998). 

Our work is also motivated by the recent work on protein design and engineering, which aims to 
rationally change a protein function -- for instance, to engineer a reporter function into a binding 
protein (Hellinga, 1997; Hellinga, 1998; Marvin et al., 1997).

RESULTS 

Overview of the 8937 Single-domain Matches

Our basic results were based on simple sequence comparisons between Swissprot and SCOP, the 
SCOP domain sequences being used as queries against Swissprot. We focused on 'mono-functional' 
single-domain matches in Swissprot, i.e. those singe-domain proteins with only one annotated 
function. The detailed criteria used in the database searches are summarized in the Methods. 

Overall, a little more than a quarter of the proteins in Swissprot are enzymes, a similar fraction are 
of known structure, and about an eighth are both. (More precisely, of the 69113 analyzed proteins in 
Swissprot, 19995 are enzymes, 18317 are structural homologues, and 8205 are both.) About half of 
the fraction of Swissprot that matched known structures were “single domain” and about a third of 
these were enzymes (8937 and 3359, respectively, of 18317). We focus on these 8937 single-
domain matches here. Notice how these numbers also show how the known structures are 
significantly biased towards enzymes: 45% (8205 out of 18317) of all the structural homologues are 
enzymes versus 29% (19995 out of 69113) for all of Swissprot.

331 Observed Fold-function Combinations

Figure 2 gives an overview of how the matches are distributed amongst specific functions and folds. 
The single-domain matches include 229 of the 361 folds in SCOP 1.35 and 91 of the 207 3-
component enzyme categories in the ENZYME database (Bairoch, 1996). Each match combines a 
SCOP fold number on the structural side (columns in Figure 2) and a 3-component EC category on 
the functional side (rows), with all the non-enzymatic functions grouped together into a single 
category with the artificial “EC number” of 0.0.0 (shown in the first row in Figure 2). This results 
in a table where each cell represents a potential fold-function combination. The table contains a 
maximum of 21068 (=229 x 92) possible fold-function combinations (and a minimum of 229 
combinations, assuming only one function for every fold). We actually observe 331 of these 
combinations (1.6%, shown by the filled-in cells). 

Overall, more than half of the functions are associated with at least two different folds, while less 
than half of the folds with enzymatic activity have at least two functions (51 out of 91 and 53 out of 
128, respectively).

Summarizing the Fold-function Combinations by 42 Broad Structure-function Classes

As listed in Table 1, folds can be subdivided in 6 broad fold classes (e.g. all-alpha, all-beta, 
alpha/beta, etc.). Likewise, functions can be broken into 7 main classes -- non-enzymes plus six 
enzyme classes, e.g. oxidoreductase, transferase, etc. This gives rise to 42 (6x7) structure-function 
classes. The way the 21068 potential fold-function combinations are apportioned amongst the 42 
classes is shown in Table 2A.

Table 2B shows the way the 331 observed combinations were actually distributed amongst the 42 
classes. Comparing the number of possible combinations with that observed shows that the most 
densely populated region of the chart is the transferase, hydrolase and lyase functions in 
combination with the alpha/beta fold class. This notion is in accordance with the general view that 
the most ‘popular’ structures among enzymes fall into the alpha/beta class. In contrast, matches 
between small folds and enzymes are almost completely missing, except for five folds in the 
oxidoreductase category. There are also no all-alpha ligases and only one all-alpha isomerase. 

Tables 2C and 2D break down the 331 fold-function combinations in Table 2A into either just a 
number of folds or just a number of functions. That is, Table 2C lists the number of different folds 
associated with each of the 42 structure-function classes (corresponding to the non-zero columns in 
the relevant class in Figure 2). Table 2D does the same thing for functions (non-zero rows in Figure 
2). Comparing these tables back to the total number of combinations (Table 2A) reveals some 
interesting findings, keeping in mind that more functions than folds reveals probable divergence 
and that more folds than functions reveals probable convergence. For instance, the alpha/beta and 
alpha+beta fold classes contain similar numbers of folds, but the alpha/beta class has relatively 
more functions, perhaps reflecting a greater divergence. (Specifically, the alpha/beta class has 73 
folds and 56 functions, while the alpha+beta class has 67 folds but only 35 functions.)

Table 2E shows the number of matching Swissprot sequences (from the total of 69113) for each of 
the 42 structure-function classes. The most highly populated categories are the all-alpha non-
enzymes, where 683 of the 1940 matches come from globins, and the all-beta non-enzymes, where 
361 of the 1159 Swissprot sequences have matches with the immunoglobulin fold. These numbers 
are, obviously, affected by the biases in Swissprot. On the other hand, if we compare the total 
matches in Table 2E with the total combinations in Table 2B it is clear that the numbers do not 
directly correlate. For instance, fewer hydrolases in Swissprot have matches with alpha/beta folds 
than with alpha+beta folds (295 vs. 452), but the number of different combinations in the first case 
is 30, as opposed to only 18 in the second case. This suggests that our approach of counting 
combinations may not be as affected by the biases in the databanks as simply counting matches. 

Table 2F and 2G give some rough indication of the statistical significance of the differences in the 
observed distribution of combinations.  In Table 2F, using chi-squared statistics, we calculate for 
each individual structure class the chance that we could get the observed distribution of fold-
function combinations over various functional classes if fold was not related to function.  Then in 
table 2G, we reverse the role of fold and function, and calculate the statistics for each functional 
class. 

Enzyme versus Non-enzyme Folds

On the coarsest level, function can be divided amongst enzymes and non-enzymes. Of the 229 folds 
present in Figure 2, 93 are associated only with enzymes and 101 are associated only with non-
enzymes. The remaining folds were associated with both enzymatic and non-enzymatic activity. 
Finally, of the 93 purely enzymatic folds, 18 have multiple enzymatic functions.

Figure 3A shows a graphical view of the distribution of the different fold classes among these 
broadest functional categories. The distribution is far from uniform. The all-alpha fold class has 30 
non-enzymatic representatives, but only 12 purely enzymatic folds and 4 folds with “mixed” (both 
types of) functions. This implies that a protein with an all-alpha fold has a priori roughly twice the 
chance of having a non-enzymatic function over an enzymatic one. The all-beta fold class has 6 
enzymatic, 17 non-enzymatic and 13 “mixed” folds. In the alpha/beta class, 34 folds are associated 
only with enzymes and 5 folds only with non-enzymes, whereas in the alpha+beta class this ratio is 
more balanced --- 28 'purely' enzymatic folds versus 22 purely non-enzymatic ones. 


Restricting the Comparison to Individual Genomes 

Figure 3A applies to all of Swissprot. Figures 3B and C show the functional distribution of folds 
taking into account the matches only in two specific genomes, yeast and E. coli. Only a fraction of 
each genome could be taken into consideration for various reasons (156 proteins in yeast, 244 
proteins in E.coli), mostly due to the great number of enzymes having multiple domains in both 
yeast and E.coli. Chi-squared tests show that the fold distribution in yeast does not differ 
significantly from that in Swissprot and that the one in E.coli differs only slightly (P<0.25 and 
P<0.02, respectively). The main difference between Swissprot and E.coli is the larger fraction of 
alpha/beta enzymatic folds in the latter (34/93 versus 26/49). There are also somewhat more non-
enzymatic all-alpha and small folds in Swissprot than in the two genomes. This is principally due to 
the greater prevalence of globins, myosins, cytochromes, toxins, and hormones in Swissprot than in 
yeast and E. coli. Many of these, of course, are proteins usually associated with multicellular 
organisms. We did a preliminary version of the fold distribution for the worm C. elegans. As 
expected this distribution turns out to be similar to that of Swissprot (data not shown).

The Yeast Genome Viewed from Different Classification Schemes 

In Figure 4 we focus on the yeast genome in more detail, trying to see the effect that different 
classification schemes have on our results. Although the total number of counts for our statistics 
decrease, of course, in just using yeast relative to all of Swissprot, yeast provides a good reference 
frame to compare a number of classification schemes in as unbiased a fashion as possible. Also, 
yeast is one of the most comprehensively characterized organisms, and there are a number of 
functional classifications available exclusively for this organism. 

In part A we cross-tabulate the structure-function combinations in yeast using the SCOP and EC 
systems as we have done for all of Swissprot in Table 2B. The yeast distribution is fairly similar to 
that of Swissprot with the only major difference being somewhat more alpha/beta transferases and 
fewer alpha/beta hydrolases than expected. (A chi-squared test gives P<~0.05 for the two 
distributions to differ. If either the transferase or hydrolase difference is removed, P increases to 
~20%.) 

Parts B show structure-function combinations based on using the CATH structural classification 
(Orengo et al., 1997) instead of SCOP. For this sub-figure we mapped the SCOP classification of a 
yeast PDB match to its corresponding CATH classification and then cross-tabulated the structure–
function combinations in the various classes. Essentially, this subfigure shows the results of Martin 
et al. (1998) just for yeast. 

In subfigures C and D, which show a COGs versus SCOP cross tabulation, we achieve the opposite 
of subfigure B. We change the functional classifications scheme but keep SCOP for classifying 
structures.  As was the case with the enzyme classification, but perhaps even more so, using COGs 
to classify function shows clearly that certain fold classes are associated with certain functions and 
vice versa. Most notably, whereas the functions associated with metabolism, which are mostly 
enzymes, are preferentially associated with the alpha/beta fold class, those associated with cellular 
processes (e.g. secretion) and information processing (e.g. transcription), show no such preference.  
They, in fact, show a marked preference for all-alpha structure. Small proteins are absent from most 
of the COGs classes, except one part of information processing and two in cellular processes. 

The COGs system classifies functions for those proteins that have clear orthologues in different 
species. Thus, conclusions based on using yeast COGs should be readily applicable to other 
genomes.  This point is highlighted in the next sub-figure “3D”, which shows a COGs versus SCOP 
classification for only the 110 COGs that are conserved across all the analyzed genomes (8) and all 
three kingdoms.  Thus, this sub-figure would appear exactly the same for E. coli, M. jannaschii or a 
number of other genomes.  It clearly shows how much more common the information processing 
proteins are among the most conserved and ancient proteins. Moreover, note how these most 
ancient proteins appear to have less of a preference for a particular structural class than the “more 
modern” metabolic ones.  This suggests that large–scale duplication of alpha/beta folds for use in 
metabolism is what gave rise to stronger fold-function association in figure 3C. 

Subfigure E shows another functional classification scheme, the MIPS Yeast functional catalogue 
(Mewes et al., 1997) (hereafter just referred to as "MIPS"). Unlike the COGs scheme, this has the 
advantage of being applicable to every yeast ORF. However, it has many more categories and about 
a third of the yeast ORFs are classified into multiple categories (sometimes five or more), making 
interpretation of the results a bit more ambiguous. 

The Most Versatile Folds and the Most Versatile Functions

Returning to considerations of all of Swissprot, Figure 5 lists the 16 most versatile folds. The top 5 
are the TIM-barrel, the alpha-beta hydrolase fold, the Rossmann fold, the P-loop containing NTP 
hydrolase fold, and the ferredoxin fold. Four of these are alpha/beta folds and one is alpha+beta. All 
five have non-enzymatic functions as well as 5 to 15 enzymatic ones. The most versatile folds 
include, in addition, four all-beta and two all-alpha folds. 

Figure 6 lists the 18 functions that have the most different folds associated with them, each having 
at least 3 associated folds. The most versatile functions are those of glycosidases and carboxy-
lyases (3.2.1 and 4.2.1), which are associated with seven different fold types each, recruited from at 
least three different fold classes. The next two most versatile functions, the phosphoric monoester 
hydrolases and the linear monoester hydrolases (3.1.3 and 3.5.1), are associated with six different 
fold types each. Most of the versatile functions are associated with folds in completely different 
fold classes. This suggests that these enzymes developed independently, providing many examples 
of convergent evolution. In contrast, only three functions, all oxidoreductases, are associated with 
folds in a single class (last three rows in Figure 6). These folds are all alpha/beta, namely the TIM-
barrel, Rossmann, and Flavodoxin folds. 

Specific Functional Convergences involving Different Folds

Even on the level of specificity of 4-component EC-numbers, several enzymatic functions are 
performed by unrelated structures. Figure 1 shows a dramatic example, two different carbonic 
anhydrases with the same EC number 4.2.1.1 but with clearly different structures (Kisker et al., 
1996). Table 3 shows further examples in a more systematic fashion. Most of these occur in 
different evolutionary lineages. For instance, the all-alpha Vanadium chloroperoxidase occurs only 
in fungi, while the alpha/beta non-heme chloroperoxidase occurs only in prokaryotes. Another 
example is beta-glucanase.  It has as many as three different structural representations, from three 
different fold classes. While it has an all-beta structure in B. subtilis, it has an all-alpha variant in B. 
circulans, and an alpha/beta structure in tobacco. 

Specific Functional Divergences on Same Fold

Quite a number of SCOP domains each have sequence similarity with Swissprot proteins of 
different function. We separated these into cases in which the structural domain has similarity to 
proteins with different enzymatic functions only and those in which a domain shows homology to 
both enzymes and non-enzymes (Table 4A and 4B, respectively). Table 4A includes the well-
known lactalbumin-lysozyme C similarity and the well documented case of homology between an 
eye-lens structural protein and an enzyme (crystallin and gluthathione s-transferase) (Cooper et al., 
1993; Qasba & Kumar, 1997). It includes several other notable divergences, such as the one 
between lysophospholipidase and galectin, and the one between an elastase and an antimicrobial 
protein (Morgan et al., 1991). Remarkably, of the seven domains in this table, three belong to the 
all-beta class. 

“Multifunctionality” versus e-value

Figure 7 shows how the number of “multifunctional” domains, i.e. domains with sequence 
similarity to proteins with different functions, varies as the function of the stringency of the match 
score threshold. We used a minimal version of SCOP in which the structures in PDB were clustered 
into 990 representative domains (see description in caption to Figure 6). The figure shows how the 
percentage of domains that have sequence similarity to proteins with different functions (in terms of 
three-component EC numbers) varies with sequence similarity. This decreases approximately 
monotonically as a function of the exponent of the e-value threshold. Interestingly, there is a 
breaking point around log (e-value) = -5, as the sharply decreasing number of functions slows down 
and the matches reach the level of biological significance. 

Our graph can be loosely compared with the classic graph of Chothia and Lesk showing the relation 
of similarity in structure to that in sequence (Chothia & Lesk, 1986). It roughly shows the chance of 
functional similarity (or more precisely the chance of functional difference) with a given level of 
sequence similarity between an enzyme and a protein of unknown function. For example, with an e-
value of 10-10, there is only an ~5% chance that an unknown protein homologous to a certain 
enzyme has in fact a different function. Moreover, our graph is in excellent agreement with the 
findings of Russell et al. who also found that the proportion of homologues with different functions 
is around 10% (Russell et al., 1998). This shows that there is a low chance that a single-domain 
protein, highly homologous to a known enzyme, has a different function.
 

DISCUSSION AND CONCLUSIONS

Overview

We have investigated the relationship between the structure and function of proteins by comparing 
functionally characterized enzymes in Swissprot with structurally characterized domains in SCOP. 
It is a timely subject, as the number of three-dimensional protein structures is increasing rapidly and 
the recent completion of several microbial genomes highlights the need for functional 
characterization of the gene products and identification of enzymes participating in metabolic 
pathways (Koonin et al., 1998). 

We tried to be as objective and as unbiased as possible, taking only enzymes with a single assigned 
function and only single-domain matches. We ignored Swissprot proteins with dubious or unknown 
function, or with incomplete sequence. Given these criteria, several tendencies are clear. The 
alpha/beta folds tend to be enzymes. The all-alpha folds tend to be non-enzymes and the all-beta 
and alpha+beta folds tend to have a more even distribution between enzymes and non-enzymes.

Our analysis of proteins from yeast and E. coli has shown that the functional distribution of folds 
does not differ greatly from the whole of Swissprot. E. coli, however, appears to have somewhat 
more alpha/beta enzymes and less non-enzymes.

Functional Assignment Complexities

We identified four specific complexities in our functional assignment worth mentioning:
 
(1) There is not always a one-to-one relationship between gene protein and reaction (Riley, 1998). 
An enzyme can have two functions or two polypeptides from two different genes can oligomerize to 
perform a single function. It might be that some of the fold-functions combinations in Figure 2 
occur together in multi-domain proteins (which otherwise were not the subject of this survey). An 
exhaustive screening revealed that only four pairs of folds in Figure 2 were present concurrently in 
multi-domain proteins. Each of these reduced by one the number of independent fold-function 
combinations.  (The four pairs were as follows, with one representative Swissprot protein in each 
category, EC numbers in brackets, and then SCOP fold numbers: PTAA_ECOLI [2.7.1] has 4.049 
and 2.055 folds, TRP_COPCI [4.2.1] has 3.057 and 4.005 folds, URE1_HELFE [3.5.1] has 4.005 
and 2.056 folds, while XYNA_RUMFL [3.2.1] has 2.018 and 3.001 folds.)

(2) The functions associated with similar structures often turn out to be analogous, even if they 
show significant difference in their EC numbers. For example, Acetyl-CoA carboxylase and 
Methylmalonyl-CoA carboxyltransferase enzymes are both actually part of enzyme complexes in 
which they perform the same function, acting as enzyme carriers. This similarity is not reflected in 
their EC classification numbers (6.4.1.2 and 2.1.3.1, respectively). 

(3) More generally, there are clearly some drawbacks to the EC system. The EC system is a 
classification of reactions, not underlying biochemical mechanisms. An enzyme classification 
system based explicitly on reaction mechanism (e.g. "involves pyridoxal phosphate" or "involves 
Ser as a nucleophile") might also prove interesting to compare with protein structure. Alternatively, 
one based on pathways might be worthwhile since, as pointed out by Martin et al. (1998), “it may 
be that more significant relationships occur within pathways, where the substrate is successively 
transferred from enzyme to enzyme along the pathway, requiring similar binding sites at each 
stage”.

(4) In all of Swissprot the majority of the 101 folds with only non-enzymatic functions probably 
have several functions, but we were not able to consider them separately here, lacking a general 
protein function classification system for non-enzymes. Such a system is not easy to derive. For 
instance, if we took only the first three words of all the description lines in Swissprot, we would 
end up with about 10000 different protein functions (besides enzymes). An approximate solution to 
this problem is offered by a recent work that has classified 81% of Swissprot into one of three broad 
categories in an automated fashion (Tamames et al., 1997). However, one way we did tackle this 
problem was by focussing on the yeast genome for which there are a number of overall functional 
classification systems.  This work showed that the preferred association of folds with certain 
functions occurs for non-enzymes as well as enzymes.  Furthermore, the results for the highly 
conserved COGs would be expected to be exactly the same in other genomes.

Biases

Our results are undoubtedly affected to some degree by the biases inherent in the databanks, e.g. 
towards mammalian, medically relevant proteins and towards proteins that easily crystallize. Such 
biases probably result in the higher representation of enzymes in the structural databases --- in the 
PDB and therefore in SCOP. This might be the cause of the higher occurrence of alpha/beta 
proteins in our tables and the higher density of matches in this class.

One interesting question related to biases is whether looking only at individual genomes instead of 
the whole database will give different results. Our results for yeast suggest that it is not necessarily 
the case. 

Comparison with Martin et al. (1998)

Martin et al. (1998) performed a similar analysis to the one here. One of the conclusions of their 
careful study was that there was no relationship between the top-level CATH classification and the 
top-level EC class.  This seems to be at odds with our results.  However, we have found the 
conclusions to be consistent. There are a number of reasons for this:

(1)	Martin et al. tabulate statistics on only the proteins in the PDB.  They found a clear alpha/beta 
preference for proteins in the oxidoreductase, transferase, and hydrolase categories (EC 1-3), 
but for the lyase, isomerase, and ligase categories (EC 4-6) they observe different tendencies.  
However, they did not have sufficient counts to establish statistical significance for this latter 
finding. (This is basically what we observe in Figure 4B.)  Because in our analysis we use all of 
Swissprot and we tabulate our statistics a little differently (in terms of combinations), we get 
more “counts” than Martin et al. Thus, we are able to argue that the different distribution of fold 
function combinations observed for lyases, isomerases, and ligases are significant.  This is 
borne out by the chi-square statistics at the end of table 2. 

(2)	Martin et al.'s “no-relationship” conclusion applies only to comparisons between the different 
enzyme classes.  However, we find our largest differences when comparing non-enzymes to 
enzymes and also comparing between the various types of non-enzymes.  

(3) The CATH classification that Martin et al. use has only three classes in its topmost level.  In 
contrast, SCOP has six top classes (table 1).  While this larger number of categories does tend 
to degrade our statistics somewhat, it also highlights some differences that cannot be observed 
in terms of the CATH classes alone - e.g. we find clear differences between alpha+beta and 
alpha/beta proteins and also between small proteins and all others.

Apparently High Occurrence of Convergent Evolution

Note that the table in Figure 2 is not square: it has more folds than functions. This shape leads to a 
number of interesting conclusions. The 331 fold-function combinations we observe for 229 folds 
and 92 functions imply that there are 1.2 functions per fold and 3.6 folds per function. However, 
these numbers are somewhat skewed by the large number of folds (101) associated only with the 
single non-enzymatic function. If we exclude these, we get 128 “enzyme-related” folds, which are, 
in turn, associated with 230 (=331-101) different fold-function combinations. This implies that for 
the enzyme-related folds there are on average 1.8 functions per fold and 2.5 folds per function 
(230/128 and 230/92). The larger number of folds per function than functions per fold seems to 
suggest that nature tends to reinvent an enzymatic function (i.e. convergent evolution) more often 
than modify an already existing one (i.e. functional divergence). 

How can we explain this? First, 1.8 is a lower estimation for the number of functions per fold as the 
non-enzymatic functions were bundled into one group here. Second, there are several examples of 
functional divergence for a fold within one 3-component enzyme category that are not reflected in 
our tables.  For instance, the 1.1.1 category has 248 different enzymes, which all share the same 
fold. Third, the results in this paper were derived from databases comprised of data from several 
organisms. It is quite possible that within one organism, functional divergence is more prevalent 
than convergent evolution.

Superfolds and Superfunctions

Are functions more diverse for the more common folds? To some degree this brings up a "chicken-
and-the-egg" issue.  Do folds have more functions because they occur more often or is it the other 
way around? The commonness of a fold is often quantified by the number of non-homologous 
sequence families accommodated by the fold, and folds accommodating many families of diverse 
sequences have been dubbed “superfolds” (Orengo et al., 1993). We find that there seems to be a 
loose connection between the number of diverse sequence families associated with a particular fold 
(in SCOP) and the functional diversity of that fold. For instance, the top superfold is the TIM-
barrel; it also has the most functions associated with it (15 different enzymatic functions as shown 
in Figure 4). On the other hand, there are exceptions: the alpha/beta hydrolases and the Rossmann 
fold are both associated with 22 sequence families in SCOP, but while the former has eight 
different enzymatic functions, the latter has only three. 

Finally, while there is a high incidence of particular functions with many folds (“superfunctions”), 
as well as folds with many functions, the distribution of superfunctions appears to be more uniform 
and less concentrated on a few exceptionally versatile individuals than is the case for folds. That is, 
comparing Figures 3 and 4 one can see that the top 9 most versatile functions are associated with 5 
to 7 folds while the top 9 most versatile folds carry out from 6 to as many as 16 functions. This last 
value is for the TIM-barrel and underscores the uniqueness of this fold as a generic scaffold (see 
Figure 1 for an illustration of this fold). 

Why Folds are associated with Functions: Chemistry vs. History

Why is a certain fold chosen to carry out a particular function?  It is, of course not possible to 
answer this question definitively at present.  However, there are two broad themes that emerge from 
our analysis.  The first is favorable chemistry.  Perhaps the TIM-barrel design simply provides a 
"more efficient" scaffold for enzyme reactions so that is why it is so prevalent.  Another factor is 
history.  Perhaps the association between a particular fold and its function reflects a particular 
"accident" that took place at the beginning of cellular evolution.  However, once this choice was 
made it was impossible to undo even if other folds would be more chemically suitable.  This could 
be the situation for the ribosomal proteins (and is borne out by the results of figure 4D).
MATERIALS AND METHODS 

Sequence Matching to Swissprot

All the protein sequences in Swissprot 35 were compared with all the protein domain sequences in 
SCOP 1.35 by standard database search programs (WU-BLAST) (Altschul et al., 1990). The 
following five criteria were used in the searches:

(1) At least three of the four components of the EC number are assigned in the DE line of the 
Swissprot entries.
(2) Fragments in Swissprot were excluded (this affected about 10% of the entries).
(3) For WU-BLAST searches an e-value threshold of .0001 was used, unless stated otherwise.
(4) Only ‘monoenzymes,’ i.e. proteins with only one enzymatic function, were considered. This 
excluded less than 0.5% of the Swissprot enzymes.
(5) Only ‘single-domain’ matches with Swissprot proteins were taken into consideration. This 
means those proteins that had a match with a SCOP domain covering most of the Swissprot 
protein. Specifically, we required that less than 100 amino acids be left uncovered in the 
Swissprot entry by a match. We are aware that this is only an approximation, as there are 
domains with less than 100 amino acids; however it is considerably less than the average length 
of a SCOP domain (163 residues) and seems to be a reasonable threshold in an automated 
approach.

All the searches were repeated using FASTA with an e-value threshold of .01 (Pearson, 1998; 
Pearson & Lipman, 1988). The results obtained by the two different comparison programs were in 
agreement with each other. That is, the FASTA searches did not result in any new combinations of 
folds and enzymatic functions (a new dot in Figure 1), and therefore are not shown. 


Sequence matching to the Yeast genome

To get as great a coverage of the yeast genome as possible, we did a sequence comparison for just 
figure 4 using an altered protocol.  We first ran the PDB against the yeast genome using FASTA 
and kept all matches with a better than 0.01 E-value (Pearson, 1998; Pearson & Lipman, 1988).  
Then, to increase our number of matches further we used the PSI-blast program (Altschul et al., 
1997). This program is somewhat more complex to run than FASTA, involving embedding the 
yeast genome in NRDB and running PDB query sequences against it in an iterative fashion, adding 
the matches found at each round to a growing profile.  We used the PSI-blast parameters adapted 
from Teichmann et al. (1998): an e-value threshold of .0005 to include matches in the profile and 
iteration of up to 30 times or to convergence.  We did not continuously parse the output and 
accepted matches at the final iteration that had E-value scores better than .0001. The number of 
iteration to convergence varies depending on the PDB domains being run. Runs that take many 
iterations such as those for the immunoglobulin superfamily take quite a long time (up to ½ hour on 
DEC 500 MHz workstation) and create large output files. In total, PSI-blast finds many more 
matches than either FASTA or WU-BLAST.  However, it has problems with certain small and 
compositionally biased proteins.  We used FASTA for these and also tried to remove compositional 
bias through running the SEG program with standard parameters (Wootton & Federhen, 1996).

How the Structural Classifications were Used: SCOP and CATH

SCOP hierarchically clusters all the domains in the PDB database, assigning a 5-component number 
to each domain (Murzin et al., 1995). The first component in the SCOP numbers denotes the 
structural class to which the domain in question belongs. The second component of the SCOP 
numbers designates the 'fold' type of the domain. There are altogether 361 different fold types in 
SCOP 1.35. The 6 SCOP classes used in this survey are listed in Table 1B.

In this study a 95% non-redundant subset of SCOP, was used, i.e. all pairs of domains had less than 
95% sequence homology. This set is denoted pdb95d and is available from the SCOP website 
(scop.mrc-lmb.cam.ac.uk). We used version 1.35, which had 2314 protein domains. (The yeast 
analysis used a more recent version of SCOP, 1.38, which had 3206 domains.)

The CATH classification classifies structures in analogous fashion to SCOP (Orengo et al., 1997). 
However, the exact structure of the classification is not the same, with an additional architecture 
level inserted between the top-level class and the fold-level. In our use of the classification, we 
created a limited mapping table that associated each SCOP domain in pdb95d with its 
corresponding classification in CATH 1.4. This was not always possible to do unambiguously. As a 
result, we left out the ambiguous matches from the statistics. 

How the Functional Classifications were Used: ENZYME, COGS, and MIPS

The EC numbers of enzymes are composed of four components (Barrett, 1997): (i) The first 
component shows to which of the six main divisions the enzyme belongs; (ii) the second figure 
indicates the subclass (referring to the donor in oxidoreductases or the group transferred in 
transferases, or the affected bond in hydrolases, lyases or ligases); (iii) the third figure indicates the 
sub-subclass (e.g. indicating the type of acceptor in oxidoreductases) and (iv) the fourth figure gives 
the serial number of the enzyme in its sub-subclass. The six main divisions are listed in Table 1A.

In the analysis of all of Swissprot, when we counted the number of non-enzymatic matches, all the 
proteins called ‘HYPOTHETICAL’ and all the proteins having an ‘-ase’ word ending but lacking 
an EC number in their description were excluded, because of their functional ambiguity. For 
relating the sequence matches of the yeast genome to the EC system, we used essentially the same 
criteria as we did for all of Swissprot (see above): single-domain, mono-enzyme matches with at 
least a 3-component EC number.

The COGs and especially the MIPS classifications are a bit more complex than the EC system in 
that they include non-enzymes as well as enzymes (Tatusov et al., 1997; Koonin et al., 1998; 
Mewes et al., 1997). They often associate multiple functions or roles to a given yeast ORF.  This 
happens for more than a third of the yeast ORFs with MIPS.  In this case, if we could clearly show a 
PDB match was associated with a single functional domain we made only that pairing.  Otherwise 
we associated all the functions assigned to a given PDB match to its respective fold.

Availability of Results over the Internet

A number of detailed tables relevant to this paper will be made available over the Internet at 
http://bioinfo.mbb.yale.edu/genome/foldfunc -- in particular, a “clickable” version of Figure 1 and 
large data files giving all the fold assignment and fold-function combinations for Swissprot and 
yeast.

Acknowledgements

We thank the Donaghue Foundation and the ONR for financial support (grant N000149710725). 
We thank Ted Johnson for help with the minimal version of the SCOP database. 


REFERENCES
Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment 
search tool. J. Mol. Biol. 215, 403-410. 
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. 
(1997a). EXTRA-REF: Gapped BLAST and PSI-BLAST: a new generation of protein database 
search programs. Nucleic Acids Res 25, 3389-402. 
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. 
(1997b). Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs. Nucleic Acids Res 25, 3389-402. 
Attwood, T. K., Beck, M. E., Flower, D. R., Scordis, P. & Selley, J. N. (1998). The PRINTS protein 
fingerprint database in its fifth year. Nucleic Acids Res 26, 304-8. 
Bairoch, A. (1996). The ENZYME data bank in 1995. Nucleic Acids Res 24, 221-2. 
Bairoch, A. & Apweiler, R. (1998). The SWISS-PROT protein sequence data bank and its 
supplement TrEMBL in 1998. Nucleic Acids Res 26, 38-42. 
Bairoch, A., Bucher, P. & Hofmann, K. (1997). The PROSITE database, its status in 1997. Nucleic 
Acids Res 25, 217-21. 
Barrett, A. J. (1997). Nomenclature Committee of the International Union of Biochemistry and 
Molecular Biology (NC-IUBMB). Enzyme Nomenclature. Recommendations 1992. 
Supplement 4: corrections and additions (1997). Eur J Biochem 250, 1-6. 
Bork, P. & Eisenberg, D. (1998). Deriving biological knowledge from genomic sequences. Current 
Opinion in Structural Biology 8, 331-332. 
Bork, P. & Koonin, E. V. (1998). Predicting functions from protein sequences--where are the 
bottlenecks? Nat Genet 18, 313-8. 
Bork, P., Ouzounis, C. & Sander, C. (1994). From Genome Sequences to Protein Function. Curr. 
Opin. Struct. Biol. 4, 393-403. 
Bork, P., Sander, C. & Valencia, A. (1993). Convergent evolution of similar enzymatic function on 
different protein folds: the hexokinase, ribokinase, and galactokinase families of sugar kinases. 
Protein Sci 2, 31-40. 
Chen, L., DeVries, A. L. & Cheng, C. H. (1997). Convergent evolution of antifreeze glycoproteins 
in Antarctic notothenioid fish and Arctic cod. Proc Natl Acad Sci U S A 94, 3817-22. 
Chothia, C. & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in 
proteins. EMBO J. 5, 823-826. 
Cooper, D. L., Isola, N. R., Stevenson, K. & Baptist, E. W. (1993). Members of the ALDH gene 
family are lens and corneal crystallins. Adv Exp Med Biol 328, 169-79. 
Coque, J. J., Liras, P. & Martin, J. F. (1993). Genes for a beta-lactamase, a penicillin-binding 
protein and a transmembrane protein are clustered with the cephamycin biosynthetic genes in 
Nocardia lactamdurans. EMBO J 12, 631-9. 
Corpet, F., Gouzy, J. & Kahn, D. (1998). The ProDom database of protein domain families. Nucleic 
Acids Res 26, 323-6. 
des Jardins, M., Karp, P. D., Krummenacker, M., Lee, T. J. & Ouzounis, C. A. (1997). Prediction of 
enzyme classification from protein sequence without the use of sequence similarity. Ismb 5, 92-
9. 
Doolittle, R. F. (1994). Convergent evolution: the need to be explicit. Trends Biochem Sci 19, 15-8. 
Fabian, P., Murvai, J., Hatsagi, Z., Vlahovicek, K., Hegyi, H. & Pongor, S. (1997). The SBASE 
protein domain library, release 5.0: a collection of annotated protein sequence segments. 
Nucleic Acids Res 25, 240-3. 
Frishman, D. & Mewes, H.-W. (1997). Protein structural classes in five complete genomes. Nature 
Struct. Biol. 4, 626-628. 
Galperin, M. Y., Walker, D. R. & Koonin, E. V. (1998). Analogous enzymes: independent 
inventions in enzyme evolution. Genome Res 8, 779-90. 
Gerstein, M. (1997). A Structural Census of Genomes: Comparing Eukaryotic, Bacterial and 
Archaeal Genomes in terms of Protein Structure. J. Mol. Biol. 274, 562-576. 
Gerstein, M. (1998a). How Representative are the Known Structures of the Proteins in a Complete 
Genome? A Comprehensive Structural Census. Folding & Design 3, 497-512. 
Gerstein, M. (1998b). Patterns of Protein-Fold Usage in Eight Microbial Genomes: A 
Comprehensive Structural Census. Proteins 33, 518-534. 
Gerstein, M. & Hegyi, H. (1998). Comparing Microbial Genomes in terms of Protein Structure: 
Surveys of a Finite Parts List. FEMS Microbiology Reviews 22, 277-304. 
Gerstein, M. & Levitt, M. (1997). A Structural Census of the Current Population of Protein 
Sequences. Proc. Natl. Acad. Sci. USA 94, 11911-11916. 
Hellinga, H. W. (1997). Rational protein design: combining theory and experiment. Proc Natl Acad 
Sci U S A 94, 10015-7. 
Hellinga, H. W. (1998). Computational protein engineering. Nat Struct Biol 5, 525-7. 
Henikoff, S., Pietrokovski, S. & Henikoff, J. G. (1998). Superior performance in protein homology 
detection with the Blocks Database servers. Nucleic Acids Res 26, 309-12. 
Hodges, P. E., Payne, W. E. & Garrels, J. I. (1998). The Yeast Protein Database (YPD): a curated 
proteome database for Saccharomyces cerevisiae. Nucleic Acids Res 26, 68-72. 
Holm, L. & Sander, C. (1998). Touring protein fold space with Dali/FSSP. Nucleic Acids Res 26, 
316-9. 
Ibba, M., Bono, J. L., Rosa, P. A. & Soll, D. (1997a). Archaeal-type lysyl-tRNA synthetase in the 
Lyme disease spirochete Borrelia burgdorferi. Proc Natl Acad Sci U S A 94, 14383-8. 
Ibba, M., Morgan, S., Curnow, A. W., Pridmore, D. R., Vothknecht, U. C., Gardner, W., Lin, W., 
Woese, C. R. & Soll, D. (1997b). A euryarchaeal lysyl-tRNA synthetase: resemblance to class I 
synthetases. Science 278, 1119-22. 
Karp, P. (1998). What we do not know about sequence analysis and sequence databases. 
Bioinformatics 14, 753-754. 
Karp, P. D., Riley, M., Paley, S. M., Pellegrini-Toole, A. & Krummenacker, M. (1998). EcoCyc: 
Encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res 26, 50-3. 
Kisker, C., Schindelin, H., Alber, B. E., Ferry, J. G. & Rees, D. C. (1996). A left-hand beta-helix 
revealed by the crystal structure of a carbonic anhydrase from the archaeon Methanosarcina 
thermophila. Embo J 15, 2323-30. 
Koonin, E. V. & Galperin, M. Y. (1997). Prokaryotic genomes: the emerging paradigm of genome-
based microbiology. Curr Opin Genet Dev 7, 757-63. 
Koonin, E. V. & Tatusov, R. L. (1994). Computer analysis of bacterial haloacid dehalogenases 
defines a large superfamily of hydrolases with diverse specificity. Application of an iterative 
approach to database search. J Mol Biol 244, 125-32. 
Koonin, E. V., Tatusov, R. L. & Galperin, M. Y. (1998). Beyond complete genomes: from sequence 
to structure and function [In Process Citation]. Curr Opin Struct Biol 8, 355-63. 
Kraulis, P. J. (1991). MOLSCRIPT - A program to produce both detailed and schematic plots of 
protein structures. J. Appl. Cryst. 24, 946-950. 
Martin, A. C., Orengo, C. A., Hutchinson, E. G., Jones, S., Karmirantzou, M., Laskowski, R. A., 
Mitchell, J. B., Taroni, C. & Thornton, J. M. (1998). Protein folds and functions [In Process 
Citation]. Structure 6, 875-84. 
Marvin, J. S., Corcoran, E. E., Hattangadi, N. A., Zhang, J. V., Gere, S. A. & Hellinga, H. W. 
(1997). The rational design of allosteric interactions in a monomeric protein and its applications 
to the construction of biosensors. Proc Natl Acad Sci U S A 94, 4366-71. 
Mewes, H. W., Albermann, K., Bahr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., 
Kleine, K., Maierl, A., Oliver, S. G., Pfeiffer, F. & Zollner, A. (1997). Overview of the yeast 
genome. Nature 387, 7-65. 
Morgan, J. G., Sukiennicki, T., Pereira, H. A., Spitznagel, J. K., Guerra, M. E. & Larrick, J. W. 
(1991). Cloning of the cDNA for the serine protease homolog CAP37/azurocidin, a 
microbicidal and chemotactic protein from human granulocytes. J Immunol 147, 3210-4. 
Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: A Structural Classification of 
Proteins for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536-540. 
Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. & Kanehisa, M. (1999). KEGG: Kyoto 
Encyclopedia of Genes and Genomes. Nucleic Acids Res 27, 29-34. 
Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, J. M. (1993). Identifying and Classifying 
Protein Fold Families. Prot. Eng. 6, 485-500. 
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). 
CATH--a hierarchic classification of protein domain structures. Structure 5, 1093-108. 
Pearson, W. R. (1996). Effective Protein Sequence Comparison. Meth. Enz. 266, 227-259. 
Pearson, W. R. (1998). Empirical statistical estimates for sequence similarity searches. J Mol Biol 
276, 71-84. 
Pearson, W. R. & Lipman, D. J. (1988). Improved Tools for Biological Sequence Analysis. Proc. 
Natl. Acad. Sci. USA 85, 2444-2448. 
Qasba, P. K. & Kumar, S. (1997). Molecular divergence of lysozymes and alpha-lactalbumin. Crit 
Rev Biochem Mol Biol 32, 255-306. 
Riley, M. (1997). Genes and proteins of Escherichia coli K-12 (GenProtEC). Nucleic Acids Res 25, 
51-2. 
Russell, R. B. (1998). Detection of protein three-dimensional side-chain patterns: new examples of 
convergent evolution. J Mol Biol 279, 1211-27. 
Russell, R. B., Sasieni, P. D. & Sternberg, M. J. E. (1998). Supersites Within Superfolds. Binding 
Site Similarity in the Absence of Homology. J Mol Biol 282, 903-918. 
Seery, L. T., Nestor, P. V. & FitzGerald, G. A. (1998). Molecular evolution of the aldo-keto 
reductase gene superfamily. J Mol Evol 46, 139-46. 
Selkov, E., Galimova, M., Goryanin, I., Gretchkin, Y., Ivanova, N., Komarov, Y., Maltsev, N., 
Mikhailova, N., Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L. & Selkov, E., Jr. 
(1997). The metabolic pathway collection: an update. Nucleic Acids Res 25, 37-8. 
Sonnhammer, E., Eddy, S. & Durbin, R. (1997). Pfam: a Comprehensive Database of Protein 
Domain Families Based on Seed Alignments. Proteins 28, 405-20. 
Tamames, J., Casari, G., Ouzounis, C. & Valencia, A. (1997). Conserved clusters of functionally 
related genes in two bacterial genomes. J Mol Evol 44, 66-73. 
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997). A genomic perspective on protein families. 
Science 278, 631-7. 
Teichmann, S., Park, J. & Chothia, C. (1998). Structural assignments to the proteins of 
Mycoplasma genitalium show that they have been formed by extensive gene duplications and 
domain rearrangements. Proc. Natl. Acad. Sci. 95, 14658-63. 
Wootton, J. C. & Federhen, S. (1996). Analysis of compositionally biased regions in sequence 
databases. Methods Enzymol 266, 554-71

Tables
Table 1, Broad Structural and Functional Categories

A. Functional categories in Swissprot 35

EC
Category  
Category Name
Abbrev-
iation
Num. of 
Functions in 
Category
0.0.0
Non-enzymes
NONENZ
1
1.*.*
Oxidoreductases
OX
86
2.*.*
Transferases
TRAN
28
3.*.*
Hydrolases
HYD
53
4.*.*
Lyases
LY
15
5.*.*
Isomerases
ISO
16
6.*.*
Ligases
LIG
9


Total:
208

List of the functional (enzymatic) categories in Swissprot and the abbreviations used throughout the 
paper. The values denote the number of 3-component EC-numbers in each category.

B. Structural classes in SCOP 1.35

Fold
Class
Class Name
Abbrev-
iation
Num. of 
Folds in 
Class
1
All-alpha
A
81
2
All-beta
B
57
3
Alpha and beta
A/B
70
4
Alpha plus beta
A+B
91
5
Multidomain
MULTI
19
6
Transmembrane
TM
9
7
Small proteins
SML
43


Total:
361

List of the structural classes in SCOP studied in this paper and the abbreviations used for the 
classes. Values denote the number of folds in each class in SCOP 1.35. Class 6 is not used in the 
analysis here.  


Table 2, Statistics over 42 structure-function classes

This table shows various totals from Figure 2 distributed among the 42 structure-function classes -- 
i.e. the seven functional categories in Table 1A multiplied by the six structural categories in Table 
1B.  Part A shows how many potential fold-function combinations there are in Figure 2 amongst 
each of the 42 classes. Part B shows how many of these 21068 possible combinations are actually 
observed. Part C shows the total number of different folds (i.e. selected columns in figure 1) in each 
class. Part D shows the total number of different functions (i.e. selected rows in Figure 2) in each 
class. Part E shows the total number of matching Swissprot proteins in the 42 classes. Note that to 
observe a fold-function combination one only needs the existence of a single match between a 
Swissprot protein and a SCOP domain. However, there can be many more. That is why the totals in 
this table sum up to so much larger an amount than 331. 

Here is an example of how to read parts A to E of the table, focussing on the all-alpha, 
oxidoreductase region. Part A shows that there are 1104 cells, filled or unfilled, in this region, 
corresponding to possible combinations. Part B shows that 13 of these 1104 cells are filled, 
corresponding to observed all-alpha, oxidoreductase combinations. Part C shows that there are 7 
folds, corresponding to columns with filled cells in this region. Part D shows that there are 8 
functions, corresponding to rows with filled cells in this region. Finally, in Part E we find that there 
are 150 Swissprot entries that have matches with a SCOP domain. They correspond to the 13 
observed combinations in Part B.

Parts F and G give information on the statistical significance of the differences observed between 
the 42 structure-function classes. Part F gives the significance that the observed distribution of fold-
function combinations in a given functional class is different than average (i.e. the null hypothesis 
that distribution of fold-function combinations is the same in each functional class). This is very 
similar to the derivation in Martin et al. (1998). A chi-squared statistic is computed for each of the 7 
functional classes in the conventional way: ?2(f) = ?s (Osf - Esf)2 / Esf , where for a given functional 
class f and structure class s, Osf is the observed number of fold-function combinations and Esf is the 
expected number. Esf is simply computed from scaling the "sum" column and row in Part B of the 
table: Esf = TsTf/T, where Ts is the total number of combinations in a given structural class s (sum 
row), Tf is the total number of combinations in a given functional class f (sum column), and T is the 
total observed number of combinations, 331. Part G gives the statistical significance that the 
observed distribution of fold-function combinations in a given structural class is different than 
average. To compute this one simply sums over functions instead of structures: ?2(s) = ?f (Osf-Esf)2 / 
Esf. After each chi-squared statistic is reported, a rough probability or P-value is given. This gives 
the chance the observed distribution could be obtained randomly. 


Table 2 (continued)

A. Number of possible combinations between folds and functions in each of 42 classes 
(number of cells in Figure 2)

A
B
A/B
A+B
MULTI
SML
sum
NONENZ
46
36
48
56
15
28
229
OX
1104
864
1152
1344
360
672
5496
TRAN
598
468
624
728
195
364
2977
HYD
1334
1044
1392
1624
435
812
6641
LY
414
324
432
504
135
252
2061
ISO
460
360
480
560
150
280
2290
LIG
276
216
288
336
90
168
1374
sum
4232
3312
4416
5152
1380
2576
21068

B. Number of observed combinations between folds and functions in each of 42 classes 
(number of filled cells in Figure 2)

A
B
A/B
A+B
MULTI
SML
sum
NONENZ
34
30
14
28
4
26
136
OX
13
5
17
3
4
5
47
TRAN
3
3
16
8
5
       
35
HYD
4
11
30
18
4
       
67
LY
2
3
13
5
       
       
23
ISO
1
2
7
4
2
       
16
LIG
       
1
2
3
1
       
7
sum
57
55
99
69
20
31
331

Table 2 (continued)

C. Number of folds in each of the 42 classes (columns with a filled cell in Figure 2)


A
B
A/B
A+B
MULTI
SML
sum
NONENZ
34
30
14
28
4
26
136
OX
7
5
9
3
3
3
30
TRAN
3
2
15
6
5
       
31
HYD
4
8
19
18
3
       
52
LY
2
3
8
5
       
       
18
ISO
1
2
7
4
2
       
16
LIG
       
1
1
3
1
       
6
sum
51
51
73
67
18
29
289

D. Number of functions in each of the 42 classes (rows with a filled cell in Figure 2)


A
B
A/B
A+B
MULTI
SML
sum
NONENZ
1
1
1
1
1
1
6
OX
8
5
9
3
3
5
33
TRAN
2
3
13
8
4
       
30
HYD
4
7
19
14
4
       
48
LY
2
2
7
3
       
       
14
ISO
1
2
5
4
1
       
13
LIG
       
1
2
2
1
       
6
sum
18
21
56
35
14
6
150

E. Total number of matching Swissprot sequences in each of the 42 fold-function classes


A
B
A/B
A+B
MULTI
SML
  sum
NONENZ
1940
1159
560
638
106
892
5295
OX
150
202
388
50
68
18
876
TRAN
65
14
363
116
174
       
732
HYD
116
394
295
452
92
       
1349
LY
40
47
168
104
       
       
359
ISO
2
54
122
22
2
       
202
LIG
       
5
26
69
24
       
124
sum
2313
1875
1922
1451
466
910
8937


Table 2 (continued)

F. How much does each of the fold classes deviate from the average distribution of 
functions?


    ?2
   P
A
17.5
<0.01
B
5.2
<0.6
A/B
32.5
<0.00002
A+B
7.7
<0.3
MULTI
9.9
<0.2
SML
27.8
<0.0002

G. How much do each of the function classes deviate from the average distribution of 
folds?


    ?2
   P
NONENZ
40.7
<0.0000002
OX
9.9
<0.08
TRAN
13.1
<0.03
HYD
17.3
<0.005
LY
10.2
<0.08
ISO
5.0
<0.5
LIG
4.3
<0.6




Table 3, Specific Convergences

Explicit enzymatic functions associated with different folds. Of the 13 different enzyme functions 
listed, eight are hydrolases, five of which belong to the 3.2.1 EC category. One of them, beta-
glucanase, is associated with three different folds. Noth that most of the enzymes in the table are 
associated with folds from different classes. Even when the folds are from the same class, as in the 
case of protein-tyrosine phosphatases, they are clearly different. Fold numbers are from SCOP 1.35. 
Domain identifiers are according to the scop syntax: d1pdbcN, where “1pdb” is a PDB id, “c” is a 
chain identifier, and “N” describes if this is the first, second, or only domain in the chain. Thus, 
d1ggta1 is the first domain in the A chain of 1GGT.

EC #
Enzymatic function
 Fold #1
 Dom #1
Swissprot 1
Fold #2
 Dom #2
Swissprot 2
1.11.1.10
CHLOROPEROXIDASE
 3.048.001
d1broa_
PRXC_PSEPY
1.068.001
d1vnc__
PRXC_CURIN
1.15.1.1
SUPEROXIDE 
DISMUTASE
 2.001.007
d1srda_
SOD1_ORYSA
4.023.001
d1mnga2
SODM_BACCA
3.1.3.48
PROTEIN-TYROSINE 
PHOSPHATASE
 3.028.001
d1phr__
PTPA_STRCO
3.029.001
d2hnp__
PYP3_SCHPO
3.1.26.4
RIBONUCLEASE H
 3.038.003
d2rn2__
RNH_ECOLI
3.039.001
d1tfr__
RNH_BPT4
3.2.1.4 
ENDOGLUCANASE
 1.061.001
d1cem__
GUN_BACSP
3.001.001
d1ecea_
GUN_BACPO
3.2.1.8 
XYLANASE
 2.018.001
d1yna__
XYN_TRIHA
3.001.001
d2exo__
XYNB_THENE
3.2.1.14
ENDOCHITINASE
 3.001.001
d1hvq__
CHIA_TOBAC
4.002.001
d2baa__
CHIX_PEA
3.2.1.73
BETA-GLUCANASE*
 3.001.001
d1ghr__
GUB_NICPL
2.018.001
d1gbg__
GUB_BACSU
3.2.1.91
EXOGLUCANASE
 2.018.001
d1cela_
GUX1_TRIVI
3.002.001
d1cb2a_
GUX3_AGABI
3.5.2.6
BETA-LACTAMASE
 5.003.001
d1btl__
BLP4_PSEAE
4.083.001
d1bmc__
BLAB_BACCE
4.2.1.1
CARBONIC 
ANHYDRASE
 2.053.001
d1thja_
CAH_METTE
2.047.001
d2cba__
CAHZ_BRARE
5.2.1.8
CIS-TRANS 
ISOMERASE
 4.018.001
d1fkd__
MIP_TRYCR
2.041.001
d2cpl__
CYPR_DROME
5.4.99.5
CHORISMATE MUTASE
 1.079.001
d1csma_
CHMU_YEAST
4.037.001
d2chsa_
CHMU_BACSU


Table 4, Specific Divergences 
List of SCOP domains that are each homologous to several Swissprot proteins with significantly 
different function. Part A. Domains homologous to proteins with different (in the last three 
component of EC numbers) enzymatic functions. In most cases, the enzymatic functions remain 
analogous, as reflected in the names of the enzymes. B. Domains homologous to proteins with both 
enzymatic and non-enzymatic functions. (See Table 3 for the SCOP domain syntax.)

A. Two different enzymatic functions

SCOP 
domain
fold number
Swissprot 1
EC num 1
Function 1
Swissprot 2
EC num 2
Function 2
d2abk__
1.001.054.001.001.001
END3_ECOLI
4.2.99.18
ENDONUCLEASE III 
GTMR_METTF
3.2.2.-
POSSIBLE G-T MISMATCHES 
REPAIR ENZYME 
d1bdo__
1.002.055.001.001.001
BCCP_ECOLI
6.4.1.2
BIOTIN CARBOXYL CARRIER 
PROTEIN OF ACETYL-COA 
CARBOXYLASE 
BCCP_PROFR
2.1.3.1
BIOTIN CARBOXYL CARRIER 
PROTEIN OF METHYLMALONYL-
COA CARBOXYL- 
TRANSFERASE 
d1dhpa_
1.003.001.003.001.004
NPL_ECOLI
4.1.3.3
N-ACETYLNEURAMINATE LYASE 
SUBUNIT 
DAPA_BACSU
4.2.1.52
DIHYDRODIPICOLINATE 
SYNTHASE 
d1hdca_
1.003.018.001.002.005
ENTA_ECOLI
1.3.1.28
2,3 DIHYDRO-2,3 DIHYDROXY-
BENZOATE DEHYDROGENASE
ADHI_DROMO
1.1.1.1
ALCOHOL DEHYDROGENASE 1
d1nipa_
1.003.024.001.005.003
BCHL_RHOCA
1.3.1.33
PROTOCHLOROPHILLIDE 
REDUCTASE 33 KD SUBUNIT 
NIFH_THIFE
1.18.6.1
NITROGENASE IRON PROTEIN
d1gara_
1.003.043.001.001.001
PUR3_YEAST
2.1.2.2
PHOSPHORIBOSYLGLYCINAMIDE 
FORMYLTRANSFERASE 
PURU_CORSP
3.5.1.10
FORMYLTETRAHYDROFOLATE 
DEFORMYLASE 
d2dkb__
1.003.045.001.003.001
OAT_RAT
2.6.1.13
ORNITHINE AMINOTRANSFERASE 
PRECURSOR 
GSAB_BACSU
5.4.3.8
GLUTAMATE-1-SEMIALDEHYDE 
2,1-AMINOMUTASE 2 
d1ede__
1.003.048.001.003.001
DMPD_PSEPU
3.1.1.-
2-HYDROXYMUCONIC 
SEMIALDEHYDE HYDROLASE 
HALO_XANAU
3.8.1.5
HALOALKANE DEHALOGENASE 
d1fua__
1.003.053.001.001.001
ARAD_ECOLI
5.1.3.4
L-RIBULOSE-5-PHOSPHATE 4-
EPIMERASE 
FUCA_ECOLI
4.1.2.17
L-FUCULOSE PHOSPHATE 
ALDOLASE 
d1lmn__
1.004.002.001.002.010
LCA_RAT
2.4.1.22
ALPHA-LACTALBUMIN 
PRECURSOR 
LYC1_PIG
3.2.1.17
LYSOZYME C-1 
d1frva_
1.005.015.001.001.001
FRHG_METVO
1.12.99.1
COENZYME F420 HYDROGENASE 
GAMMA SUBUNIT 
MBHS_AZOCH
1.18.99.1
UPTAKE HYDROGENASE 
SMALL SUBUNIT PRECURSOR 


Table 4 (continued)

B. Enzyme and Non-Enzyme

SCOP domain
Fold number
Swissprot 1
Enzymatic function
EC number
Swissprot 2
Nonezymatic function
d1gsq_1
1.001.034.001.001.007
GTS2_MANSE
GLUTATHIONE S-
TRANSFERASE 2 
2.5.1.18
SC11_OMMSL
S-CRYSTALLIN SL11 (MAJOR 
LENS POLYPEPTIDE)
d1lcl__
1.002.018.001.003.003
LPPL_HUMAN
EOSINOPHIL 
LYSOPHOSPHOLIPASE 
3.1.1.5
LEG7_RAT
GALECTIN-7  
d1brbe_
1.002.029.001.002.003
CFAD_RAT
ENDOGENOUS VASCULAR 
ELASTASE  
3.4.21.46
CAP7_HUMAN
AZUROCIDIN  (ANTIMICROBIAL, 
HEPARIN-BINDING PROTEIN) 
d1mup__ ..
1.002.039.001.001.007
PGHD_HUMAN
PROSTAGLANDIN-D 
SYNTHASE 
5.3.99.2
LACC_CANFA
BETA-LACTOGLOBULIN III 
..d1mup__
1.002.039.001.001.007



QSP_CHICK
QUIESCENCE-SPECIFIC 
PROTEIN 
d2hhma_ ..
1.005.007.001.002.001
MYOP_XENLA 
INOSITOL MONO-
PHOSPHATASE 
3.1.3.25
SUHB_ECOLI 
EXTRAGENIC SUPPRESSOR 
PROTEIN SUHB
..d2hhma_
1.005.007.001.002.001
STRO_STRGR
DTDP-GLUCOSE SYNTHASE 
2.7.7.24


d1isua_
1.007.029.001.001.001
IRO_THIFE
IRON OXIDASE PRECURSOR  
(FE(II) OXIDASE) 
1.16.3.-
HPIT_RHOTE
HIGH POTENTIAL IRON-SULFUR 
PROTEIN (HIPIP)


Figures
Figure 1, Specific Example of Convergent and Divergent Evolution
TOP shows an example of convergent evolution, structures of two carbonic anhydrases with the 
same enzymatic function (EC number 4.2.1.1) but with different folds. Drawn with Molscript 
(Kraulis, 1991) from 1THJ (left handed beta helix) and 1DMX (flat beta sheet). BOTTOM shows 
an example of possible divergent evolution, the TIM barrel. This fold functions as a generic 
scaffold catalyzing 15 different enzymatic functions. A schematic figure of the TIM barrel fold is 
shown with numbers in boxes indicating the different location of the active site in four proteins that 
have this fold. These four proteins -- xylose isomerase, aldose reductase, enolase, and adenosine 
deaminase -- carry out very different enzymatic functions, in four of the main EC classes (1.*.*, 
3.*.*, 4.*.*, and 5.*.*). They have active sites at very different locations in the barrel, yet they all 
share the same fold. 

See figure over...


Figure 1 (continued)



      

 


Figure 2, Overview

Overview of all the single-domain matches between proteins in Swissprot 35 and domains in SCOP 
1.35. Sequences were compared with BLAST using the match criteria described in the methods.  
The matches are clustered into 92 functions (based on 3-component EC numbers), which are 
arranged on each row, and 229 folds (based on SCOP fold numbers), which are arranged on each 
column. The first row indicates the matches with non-enzymes. There are, thus, 21068 (=92 x 229) 
possible combinations shown in the figure. Only the 331 are actually observed. These are indicated 
by filled-in black squares. 

See figure over...




Insert Figure 2


Figure 3, Chart with Breakdown among Structure-Function Classes in 2 
Genomes 
Charts and tables showing the number of folds in each fold class associated with only enzymatic 
(ENZ), only non-enzymatic (nonENZ), and both enzymatic and non-enzymatic functions (Both). 
The results are shown for all of Swissprot (part A), for just the yeast genome (part B), and for just 
the E. coli genome (part C). The results for individual domains in a minimum set of SCOP domains 
also support these tendencies (data not shown). The numbers in part B are not based on the PSI-
blast protocol used for Figure 4.  Rather they are found just as “subsets” of the overall Swissprot 
results to make them readily comparable with the rest of the paper.  Because of this the numbers in 
this figure will not match exactly those in Figure 4 -- the difference having to do with the greater 
number of fold-function combinations found by PSI-Blast as compared to WU-blast.

A. All of Swissprot

 

Figure 3 (continued) 

B. Yeast
 

C. E. coli

 

Caption to the figure.

Figure 4, Structure-function Classes in the Yeast Genome Analyzed 
Through a Variety of Classification Schemes
This figure shows the distribution of fold function combinations in the yeast genome as analyzed by 
a variety of different structure and functional classifications.  Each of the figures is a cross 
tabulation of one structural classification scheme (on the column heads) versus a functional 
classification (row heads). Part A shows SCOP versus ENZYME; Part B, CATH vs. ENZYME; 
Part C, SCOP vs. COGs; Part D, SCOP vs. Most Conversed COGs; Part E, SCOP vs. MIPS 
Functional Catalogue. Each of the grid boxes gives the number of fold-function combinations 
within a structure-function class.  This number is expressed as a percentage of the total number of 
combinations in the diagram to make the graphs readily comparable.  The total number of 
combinations in each of the sub figures is 141 (A), 77 (B), 1207 (C), 120 (D), and 66 (E). Some 
notes on the subfigures: Part A is directly comparable with the cross tabulation in table 2B for all of 
Swissprot. In Parts D and E, we employ the COGs scheme in exactly the same fashion as we did the 
ENZYME classification. We form combinations between individual yeast COGs and SCOP folds 
(e.g. COG 0186 with fold 2.26) and then we place these combinations into larger structure-function 
classes.  The COGs overall functional classes are denoted by a single letter and then are in turn 
grouped into three broader areas (so, for instance, the 0186-2.26 pair would go into the structure-
function class all-beta, J).  We, likewise, proceed similarly for the MIPS yeast functional catalogue.  
This gives each function a 2 or 3 component number similar to an EC number (e.g. 07.20.3 or 
06.2).  We use the first two numbers to create combinations with SCOP folds and then use the top 
number to create the functional classes shown in the diagram. For Part E we just use the 110 COGs 
that are present in all 8 genomes in the current COGs analysis (E. coli, H. influenzae, H. pylori, M. 
genitalium, M. pneumoniae, Synechocystis, M. jannaschii, yeast).



Rough Layout of Subfigures to Figure 4

A
 

B        
E



 

C
 
D
 

Figure 4 (continued), ENLARGEMENT of Parts A and B
 


 


Figure 4 (continued), ENLARGEMENT of Part C

 



Figure 4 (continued), ENLARGEMENT of Part D

 

Figure 4 (continued), ENLARGEMENT of Part E
 

Figure 5, The Most Versatile Folds 
The functions associated with the 16 most versatile folds are shown. Values in the table denote the 
number of matches between a particular fold type in pdb95d (designated by its fold number in 
SCOP 1.35) and an enzyme category (represented by the first three components of the respective 
EC numbers). Here and in the following tables the same parameters were used for matching as in 
Figure 2. The numbers in the top row indicate the number of functions a particular fold is 
associated with. The identifiers above the fold numbers are either PDB or SCOP identifiers of 
representative structures (the latter only if the PDB entry contains more than one domain or chain). 
(See the caption to Table 3 for the syntax of SCOP identifiers.) The first row in the table with the 
artificial 0.0.0 EC number shows the number of matches with non-enzymatic functions. Among the 
two all-alpha folds in the table, Cytochrome P450 (1.063) is exclusively enzymatic, associated with 
five different enzyme functions, all related to Cytochrome P450. Only one alpha+beta fold, 
Ferredoxin (4.031), is present in the table, predominantly with matches with non-enzymatic 
ferredoxins, but also with enzymes in four different enzyme classes. In the multi-domain class, 
Beta-Lactamase/D-ala carboxypeptidase (5.003) has the most matches with penicillinase (EC 
number 3.5.2) and only one match with a non-enzyme, which also binds penicillin but has no 
enzymatic activity (Coque et al., 1993). The class of small domains is represented only with one 
fold, membrane-bound rubredoxin-like (7.035), and has matches only with enzymes. It is possible 
that some proteins classified as “non-enzymes” may indeed be enzymes, missing the corresponding 
EC number. In this case, our analysis may be potentially useful in pointing to which non-enzymes 
may actually be enzymes.

See figure over...

Figure 5 (continued) 
 

Figure 6, The Most Versatile Functions 

Values in the table denote the number of matches between a particular enzyme category (designated 
by the first 3 components of their EC numbers) and a SCOP 1.35 fold (designated by their fold 
numbers). This figure follows the same conventions as Figure 4. The rows are arranged in 
decreasing order according to the number of different folds with which they are associated 
(numbers shown in the first column). A hash (“#”) in any cell indicates that its value is greater than 
10.

See figure over...


Figure 6 (continued) 



Insert Revised Figure 6


 Figure 7, Multi-functionality versus e-value threshold 
The graph shows how the percentage number of multifunctional enzymatic domains varies as the 
function of the e-value threshold. A multi-functional domain occurs when a particular domain in 
SCOP matches domains in Swissprot with different enzymatic function. For these calculations, we 
had to use a more minimal version of SCOP than the pdb95d dataset referred to in the methods to 
prevent double matches -- i.e. two SCOP domains matching a single Swissprot domain.  The 
construction of this minimal SCOP was described previously (Gerstein, 1998). Basically, all the 
domains in SCOP were clustered via a multi-linkage approach into 990 representative domains, 
such that no two domains matched each other with a FastA e-value better than .01.