Current Opinion in Structural Biology Vol. 9, No. 3, June 1999 Advances in structural genomics [Review article] Sarah A Teichmann, Cyrus Chothia, Mark Gerstein Current Opinion in Structural Biology 1999, 9:390-399. |
|
The purpose of structural genomics can be defined as the assignment of three-dimensional structures to the protein products of genomes (proteomes) and the investigation of the biological implications of these assignments. If the structure assigned to a new protein is homologous to one already known, it provides an indication of its probable function and evolutionary relationships. If structures can be assigned to all or to a significant fraction of the products of a whole genome, it will provide a much better understanding of the evolution and physiology of an organism.
The assignment of structures to proteomes can be carried out on two levels experimental and computational. The experimental level involves the directed, large-scale determination of the protein structures using NMR spectroscopy or X-ray crystallography [1] [2] [3]. The computational level involves the assignment of structures to proteins using calculations that mostly involve demonstrating homology to proteins of known structure. Here, we review the recent advances made at the computational level. The first part of the review deals with methods that have been used to assign structures to genome products. The second part reviews the biological implications of this work. Throughout, we pay particular attention to work related to the genome of Mycoplasma genitalium (MG), the second genome to be sequenced [4]. With only 479 proteins, this genome has emerged as the initial focus and bench-mark for computational investigations in structural genomics.
Three classes of computational methods are used to assign structures to genome sequences: the detection of distant homologies (this usually involves pairwise or multiple sequence comparisons); fold recognition (which tries to determine whether the sequence of a new protein fits a fold that is close to that of a known structure); and predictions based on statistical rules derived from structures. (These are used to predict secondary structure, transmembrane [TM] helices and coiled-coils.)
Not long after the first few complete bacterial genome sequences were published, their sequences were matched to the sequences of proteins of known structure (PDB sequences) using pairwise sequence comparison methods, such as FASTA [5], SmithWaterman [6] and BLAST [7]. Sometimes these comparisons took the whole sequence of the known structure as the 'query', but the sequences of individual structural domains are more useful, as these correspond to the functional and evolutionary units of proteins. The domains that have been used in the assignment of structures to genome sequences are those described in the SCOP [8] and CATH [9] databases.
Pairwise matches between genome sequences and known structures form part of genome analysis systems such as GeneQuiz [10], PEDANT [11] [12] and GeneCensus [13] [14] [15]. Early calculations matched between 8 and 12% of the proteins from different genomes to known structures [10] [11] [12] [13]. The rapid increase in the number of known structures has, for more recent calculations, increased these proportions to between 11 and 20% [15]. For MG, these matched sequences comprise about 16% of the residues in the proteome.
Pairwise sequence comparisons detect only about half of the evolutionary relationships between proteins with 2030% sequence identity, however, and, for related proteins with less than 20% identity, the proportion is much smaller [16] [17]. There are many protein families whose members diverge to the point at which they have sequence identities well below 30% and, consequently, many homologous relationships between known structures and genome sequences cannot be detected by pairwise comparison.
In order to try to overcome the limitations of pairwise comparisons, search procedures based on the shared characteristics of sets of related sequences have been developed. One of the most widely used of these procedures is the PSI-BLAST (Position-Specific Iterated Basic Local Alignment and Search Tool) program [18]. This program does an initial, gapped BLAST search to collect close homologues of the query in a sequence database and then builds a profile of the query sequence and its close homologues. The profile is then matched to the database and more homologues are collected. These new homologues are added to the profile and another search is carried out. This process can be iterated as many times as the user specifies or until no more homologues are found.
A quantitative assessment of PSI-BLAST [17] showed that, for evolutionary relationships among proteins whose sequence identities are less than 30%, it can detect three times as many relationships as pairwise comparisons. Of course, PSI-BLAST is significantly more computationally demanding and complicated to use than pairwise comparison methods. On a current work station (500 MHz DEC alpha), building up a PSI-BLAST profile can take between 1 and 30 min and can consume a considerable amount of disk space.
This past year, PSI-BLAST has been used by three groups to assign structures to genome sequences. All these attempts included detailed assignments for the MG genome. We summarise the work below, quoting, where appropriate, updated results from web sites, rather than the original data from the papers. There were differences in both the parameters that were used in the three calculations and the manner in which they were carried out [19] [20] [21]. The performance of PSI-BLAST is affected both by such differences [17] and by the particular database from which the homologues are collected. Together, these factors account for most of the variations in the number of matches to MG sequences that were made by the different calculations.
Huynen et al. [19] were the first to use PSI-BLAST to match PDB sequences to MG proteins. Both sets of sequences were preprocessed to remove regions of low complexity (LC), TM helices, coiled-coils and cysteine-rich proteins, as these readily give false matches. They found that 184 regions in 172 MG sequences (37%) matched PDB sequences, with different regions of 12 of the MG sequences matched by two different PDB sequences. Overall, these matches cover 23% of all the MG residues.
Teichmann et al. [20] also used PSI-BLAST to match PDB domain sequences to MG proteins. They did this comparison in a two-way fashion, first searching using PDB domain sequences as queries and MG sequences as targets embedded in a large, nonredundant sequence database (NRDB90 [22]) and then using MG sequences as queries and PDB sequences as targets embedded in NRDB90. In the most recent version of this calculation, PSI-BLAST matched 314 PDB domains to all or part of 223 MG proteins (47%) (http://www.mrc-lmb.cam.ac.uk/genomes/MG/). Sixty-four of the matched MG proteins had different regions matched by between two and five PDB domain sequences. Overall, the matched regions cover 33% of all the MG residues.
Wolf et al. [21] used PSI-BLAST to assign structures to the genomes of MG, ten other prokaryotes, Saccharomyces cerevisiae and C. elegans. These calculations matched PDB domain sequences to all or part of 181 (39%) of the MG sequences, to 1934% of sequences in the other prokaryotes, to 24% of sequences in S. cerevisiae and to 21% of sequences in C. elegans. On average, 11% of the matched proteome sequences had between two and five PDB domain sequences matching different regions.
The BASIC procedure, developed by Rychlewski and co-workers [23], provides a further refinement to the PSI-BLAST approach. Homologues are collected by PSI-BLAST for query and target sequences and profiles created for both sets of sequences. Relationships are then detected by profile-profile matching. Rychlewski et al. [24] used this procedure to match 1151 representative PDB sequences to the MG proteome. Using this method, 139 (29%) MG proteins were matched. (These are updated results subsequent to publication.)
Sanchez and Sali [25] used a pairwise comparison to find matches between sequences of the yeast S. cerevisiae [26] and 1151 representative PDB sequences. All or part of 2256 (36%) S. cerevisiae sequences matched a PDB sequence. Of these sequences, 1071 had a good enough match for a detailed three-dimensional model to be built. For the other matched sequences, the divergence of structure, which occurs commonly for more distantly related proteins, only allows the construction of outline models.
Threading procedures cover a variety of techniques that try to determine whether the sequence of a protein with an unknown structure is compatible with that of a known structure [27] [28] [29]. The first detailed assignment of structures to the MG proteome used one such technique the fold prediction method of Fischer and Eisenberg [30]. With this method, the compatibility of the query sequence with each of the folds in a library of known structures is determined by both its predicted secondary structure and its sequence characteristics, as given by a matrix of residue similarities. The query sequence can be used by itself or with homologues. The most recent use of this method matched PDB sequences to 160 MG sequences, of which 75 could also be found using pairwise comparisons.
Grandori [31] used the threading program ProFIT [28] to match PDB sequences to M. pneumoniae sequences that are shorter than 200 residues. Matches were found for 12 genome sequences that could not be found using pairwise comparisons.
Secondary structure predictions were carried out on five genomes [12] using PREDATOR [32] and on eight genomes [13] [15] using GOR [33]. One of the more interesting results to emerge from these calculations was that the genomes have a similar overall composition in terms of secondary structure, although they have very different amino acid compositions. This was unexpected in light of the well-known and markedly different secondary-structure propensities of individual amino acids.
Several groups have carried out calculations to determine the occurrence of membrane proteins in genome sequences [13] [14] [34] [35] [36] [37] [38] [39] [40]. The overall number of membrane proteins found depends somewhat on the prediction method and threshold used. Nevertheless, there seems to be a broad agreement that all or part of 2030% of the proteins in microbial genomes are membrane proteins. Membrane protein structures can be classified by how many TM helices they have. In all the surveys, the occurrence of membrane proteins with a given number of TM helices falls off rapidly as the number of helices increases; thus, only a small fraction of membrane proteins have large numbers of TM helices.
As described in the previous section, a number of groups have used pairwise sequence comparison, PSI-BLAST, profiles or threading to match sequences of proteins with known structure to sequences from the genome of M. genitalium [10] [12] [14] [15] [19] [20] [21] [24] [30] [41] (see Table 1). Most of these groups have made their published results available on the web and their sites often carry 'updated results' that have been obtained subsequent to the original publications (see Fig. 1 for details).
Table 1. A comparison of different calculations of the number of MG proteins that are homologous, all or in part, to PDB sequence* | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
*This table shows a comparison of the updated assignments of structures to MG sequences made by Teichmann et al. (T) [20], Wolf et al. (W) [21], Huynen et al. (H) [19], Fischer and Eisenberg (F) [30] and Rychlewski et al. (R) [24]. For comparison, a representative result of FASTA assignments is listed as well (Gerstein [G] [13]). The comparisons are based just on common ORFs (open reading frames). All the comparisons are based on the original TIGR (The Institute for Genomic Research) MG ORF file [4], which contained 468 genes, rather than the most current one, which contains 479 genes. Note that the W matches are based on some alternative gene definitions and so have ORF matches that do not correspond to either of the TIGR ORF files. We provide a more detailed comparison table on the Web (via http://bioinfo.mbb.yale.edu/genome/MG or http://www.mrc-lmb.cam.ac.uk/genomes/MG). In addition, many of the matches are collected together in the PRESAGE database [41]. |
View Image |
Figure 1 A pie chart showing the current status of the structural annotation of the MG genome (as of January 1999). The different parts of the pie chart are described in detail in Table 2. For each of the representative PSI-BLAST calculations, we used the results described as 'two-way PSI-BLAST'. These are updated versions of those results described previously [20]. All of the calculations for the pie chart were based on the original TIGR MG ORF file [4], which contained 468 genes, rather than the most current one, which contains 479 genes. This was to enable us to merge other annotations, which were often based on the earlier ORF file. The current status of the level of annotation of MG is available from http://bioinfo.mbb.yale.edu/genome/MG and http://www.mrc-lmb.cam.ac.uk/genomes/MG. These web sites report data for both the current 479 gene ORF file and the original 468 gene file. |
Table 2. Description of methods used to determine annotation. | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Overall, the results produced up to early 1999 by the different groups show a high degree of consistency, as shown in Table 1. When matches to an MG protein are made by more than one group, as is commonly the case, they are matched to the same PDB sequence or to a homologue in the large majority of cases. On a more detailed level, of the 352 SCOP domains assigned to MG proteins, 88 are assigned by one group only and only 12 (3.5%) are assigned different superfamilies by the different groups. Small differences in the lengths of the matched regions are also common. (The details of these assignments can be found at the web sites cited elsewhere in the text.) Examining the union of the matches made by many of the different groups, we find that more than half (242) of the MG sequences are matched, all or in part, by a PDB sequence and these matches cover more than 40% of all the MG residues (see Fig. 1).
In addition to the regions matched to PDB sequences, about 79 MG sequences have the characteristics of integral membrane proteins and about 65 have long, nonglobular regions. This results in a total of about two-thirds of the MG sequences having some structural annotation (Fig. 1) (some of the assignments are to different regions of the same protein).
The complete structural characterisation of the MG sequences will not be achieved in the near future if structures continue to be solved in an untargeted fashion. This is shown by Fig. 2, a graph of how the structures homologous to MG sequences have been determined over the past 25 years the development of MG structural annotation over time. Experimental structural genomics projects, many of which have started recently, will target regions of the proteome that have neither a matching PDB sequence nor a different type of structural annotation (LC, TM and so on). Thus, they will increase the gradient of the graph in Fig. 2 such that genomes should soon be almost completely structurally annotated. To be optimally efficient about the target choice for experimental structural genomics projects, the uncharacterised regions must be clustered at the sequence level. A list of the uncharacterised regions of MG and their sequence clusters will be made available on the web (http://www.mrc-lmb.cam.ac.uk/genomes/MG or http://bioinfo.mbb.yale.edu/genome/MG).
View Image |
Figure 2 A time-line showing how MG structural annotation is changing each year. The panel shows how the fraction of residues in the MG genome that have been characterised increases each year with the addition of new structures to the PDB imagining that the complete sequences of MG and the current sequence-matching techniques (e.g. PSI-BLAST) were known a quarter of a century ago. In particular, the time-line shows how the black 'PDB match' section changes over time. This time-line is based on exactly the same 'sequence masking' methodology discussed previously [15] [53]. In contrast to the previous analysis [53], however, we come to a somewhat more optimistic conclusion that a large fraction of a genome can be structurally annotated. There are two reasons for this difference. Firstly, we focus on one small genome, rather than on an average of all the known genomes and, secondly, we use the union of all the known structural matches made from advanced methods (e.g. PSI-BLAST), rather than the matches generated by one rather conservative method (FASTA). |
Many of the results for sequences of MG that are discussed in the following sections of the review are based on the updated, two-way PSI-BLAST assignments of 314 domains to 223 MG proteins [20].
Small proteins and most medium-size proteins contain a single domain. Large proteins comprise two or more domains, of which the large majority are known to undergo independent duplications and recombinations [8] [42] [43]. The average size of the domains in proteins of known structure is about 175 residues.
The distributions of the lengths of sequences in bacterial and archaeal genomes have been to found to follow very similar extreme-value distributions [15]. The most common length, 190 residues, is roughly the length of a single PDB domain and the average length is approximately 280 residues in archaea and 330 residues in eubacteria. The distributions of sequence lengths in S. cerevisiae and C. elegans are similar to those in prokaryotes, but there is a greater preponderance of long sequences. This results in larger average lengths (465 for yeast and 425 for the worm). These results imply that a significant fraction of the proteins produced by genomes contain two or more domains.
Based on the various types of structural annotation shown in Fig. 1 and Table 2, it is possible to roughly estimate the number of soluble protein domains in MG. The two-way PSI-BLAST calculation shows that 223 (47%) MG sequences are matched, all or in part, by a PDB sequence. Of these, 83 MG sequences were completely matched by a single, known structural domain and 39 by between two and five domains [20]. Another 101 sequences were matched to between one and four domains and had unmatched regions that are long enough for at least one additional domain to be present. These figures show that, for the MG sequences matched by PDB sequences, close to one-third of the sequences contain one domain and two-thirds have two or more domains.
So, according to this calculation, 314 PDB domains match all or part of a total of 223 MG proteins. These matches cover 33% of the MG proteome. Excluding the well-characterised TM, LC and linker regions, as well as the 314 PDB domains, we are left with regions that, presumably, code for soluble proteins with globular structures, but that do not have a known fold. As indicated in Fig. 1, these 'uncharacterised regions' currently comprise about 40% of the MG genome (by residue). They are formed from about 270 whole or partial sequences. Of these, about three-quarters contain less than 200 amino acids and most of these probably have a single domain. Of the remainder, most probably have multiple domains. After putting the results of the known PDB matches and the uncharacterised regions together, one comes up with a very rough estimate of 700 soluble, globular domains in MG, of which about 200 form single-domain proteins.
If the total number of families to which most proteins belong is small [44], high levels of domain duplication would be expected in the genome sequences. Pairwise comparison of genome sequences and the clustering of matched sequences into families indicated that the proportion of sequences that have arisen by gene duplication is between a quarter (in small bacterial genomes) and half (in large bacterial genomes) [45] [46] [47] [48]. The sequences in individual bacterial genomes, however, have relatively few pairs with residue identities that are greater than 30% (see, for example, [20]). This means that duplication rates based on pairwise sequence comparison must be underestimates. For distantly related proteins, the detection of evolutionary relationships requires structural, functional and sequence information, which is only available for proteins whose structures have been solved.
In the two-way PSI-BLAST assignment of structures to MG proteins, 223 were matched, all or in part, by 314 PDB domain sequences. The inspection of the superfamily assignment given to the PDB domains in SCOP shows that they belong to 124 different superfamilies. Eighty-two MG sequence regions are unique representatives of their superfamily and 232 sequences belong to one of 42 superfamilies, with each having between 2 and 60 homologues. Thus, the proportion of these MG sequences that has arisen by gene duplication is (3148242)/314, that is, 60%. This proportion is more than twice that found from pairwise sequence comparisons [20].
Using PSI-BLAST, the sequences of proteins of known structure can be matched to 30 and 27% of the protein sequences in S. cerevisiae and C. elegans, respectively ([21]; SA Teichmann, C Chothia, unpublished data). These matched regions form, respectively, 18% and 15% of the amino acids in the two genomes. Carrying out calculations similar to those described above shows that the proportion of domains produced by gene duplications in matched regions is 88% for S. cerevisiae and 95% for C. elegans.
Proteins have evolutionary and structural relationships. Proteins with evolutionary relationships are descended from a common ancestor. For more closely related proteins, this can be detected from sequence similarities, which allow proteins to be clustered into families. For distantly related proteins, the detection of evolutionary relationships requires structural, functional and sequence information. This information is used collectively in the SCOP database to cluster proteins of known structure into superfamilies.
Proteins can also have structural similarities that arise not from common descent, but as a result of physics and chemistry favouring certain secondary-structure packing and chain topologies ([see [49] for a recent review). Proteins that have the same major secondary structures with both the same arrangement and the same topology are clustered into folds that are described in the SCOP and CATH databases. It is important to note that two proteins having the same fold does not, by itself, indicate their descent from a common ancestor.
Bacterial genome sequences have been clustered into families using pairwise comparisons [13] [43] [46] [47] [48]. These calculations showed that the sizes of the families have an exponential character many families with one or a few sequences and a few families with many sequences. Subsequently, using the SCOP classification, a number of groups have determined the superfamily and fold membership of the genome sequences that match known structures [13] [14] [20]. Wolf et al. [21] have described the most common folds in 13 genomes. Lists of the six largest superfamilies found in various genomes are given in Table 3 (SA Teichmann, C Chothia, unpublished data). The distributions of the superfamilies (Table 3) show systematic differences when small parasitic bacteria are compared with free-living bacteria and when both are compared with eukaryotes, a fact previously noted with regard to fold distributions [14] [21] [50].
Table 3. | ||
In Table 3, the size of a superfamily is measured by the number of different homologues within the genome. Folds and superfamilies can also be ranked by their level of mRNA expression [14] or even by the direct measurement of protein levels in the cell. These will give different rankings, in particular, elevating ribosomal folds, which are highly expressed, but not highly duplicated.
It should be noted that the current information on the superfamilies and folds in genome sequences is limited to the 1535% of the genome that can be matched to sequences of known structure and that, in genomes, there are undetected homologues of known structure, as well as common folds that are not related to the structures known at present. Consequently, one should take the current numbers (Table 3) as only rough and somewhat biased approximations. Nevertheless, it is remarkable that the common folds identified in the early calculations [13] [14] are largely similar to those identified using the newer PSI-BLAST methods [21].
The work described here has shown that pairwise comparisons, PSI-BLAST, profiles and threading techniques can assign structures to all or part of between one-quarter and one-half of the sequences in different genomes and that these matches cover between 15 and 40% of all residues in the genome. We can expect these proportions to increase rapidly as a result of improvements in computational techniques and experimental structural genomics projects. The most powerful sequence matching technique, which uses hidden Markov models [17] [51] [52], has not been used for large-scale matching so far. On the experimental side, we see from Fig. 1 that most of the structures that match approximately 40% of the MG genome were determined over the past 12 years. The rapid increase in the number of both structure determinations and, particularly, programs for experimental structural genomics should mean that the time required to determine the globular structures that occur in the remaining approximately 35% of the genome should be much shorter.
Although the current results only cover parts of genomes, they are of great interest. The matched regions are often the product of gene duplications of domain sequences and their recombination. A few families have many members and play a major role. There is no reason, at present, to believe that results of the same kind will not be found for globular domains in the regions that, up to now, have not been assigned structures. The current results support the hypothesis that the domains that form the most proteins come from a small number of superfamilies. Also, the observation that many of the proteins involved in the most basic functions of simple cells are the product of duplications and recombinations implies that these processes initially occurred in cells that were much simpler than any now known [46].
Jones [57] recently published structural assignments to 218 MG ORFs with a high reliability, which is 46% of the proteins and 30% of the amino acids. This calculation found 17 assignments to MG ORFs not found by any of the groups in Table 1.
We thank Eugene Koonin for providing information prior to publication and for comments on the manuscript. We thank Steven Brenner and Leszek Rychlewski for data. SAT thanks the Boehringer Ingelheim Fonds for support and MG thanks the Donaghue Foundation.