|A Small Reservoir of Disabled ORFs in the Yeast Genome and its Implications for the Dynamics of Proteome Evolution|
|pp. 409-419 (doi:10.1006/jmbi.2001.5343)|
Paul Harrison1, Anuj Kumar2, Ning Lan1, Nathaniel Echols1, Michael Snyder1, 2, Mark Gerstein1
1Department of Molecular Biophysics & Biochemistry, and
2Department of Molecular Cellular & Developmental Biology, Yale University, 266 Whitney Ave., P.O. Box 208114, New Haven, CT, 06520-8114, USA
|(Received 26 June 2001; received in revised form 26 November 2001; accepted 26 November 2001; published electronically February 26, 2002)|
We surveyed the sequenced Saccharomyces cerevisiae genome (strain S288C) comprehensively for open reading frames (ORFs) that could encode full-length proteins but contain obvious mid-sequence disablements (frameshifts or premature stop codons). These pseudogenic features are termed disabled ORFs (dORFs). Using homology to annotated yeast ORFs and non-yeast proteins plus a simple region extension procedure, we have found 183 dORFs. Combined with the 38 existing annotations for potential dORFs, we have a total pool of up to 221 dORFs, corresponding to less than ~3% of the proteome. Additionally, we found 20 pairs of annotated ORFs for yeast that could be merged into a single ORF (termed a mORF) by read-through of the intervening stop codon, and may comprise a complete ORF in other yeast strains. Focussing on a core pool of 98 dORFs with a verifying protein homology, we find that most dORFs are substantially decayed, with ~90% having two or more disablements, and ~60% having four or more. dORFs are much more yeast-proteome specific than live yeast genes (having about half the chance that they are related to a non-yeast protein). They show a dramatically increased density at the telomeres of chromosomes, relative to genes. A microarray study shows that some dORFs are expressed even though they carry multiple disablements, and thus may be more resistant to nonsense-mediated decay. Many of the dORFs may be involved in responding to environmental stresses, as the largest functional groups include growth inhibition, flocculation, and the SRP/TIP1 family. Our results have important implications for proteome evolution. The characteristics of the dORF population suggest the sorts of genes that are likely to fall in and out of usage (and vary in copy number) in a strain-specific way and highlight the role of subtelomeric regions in engendering this diversity. Our results also have important implications for the effects of the [PSI+] prion. The dORFs disabled by only a single stop and the mORFs (together totalling 35) provide an estimate for the extent of the sequence population that can be resurrected readily through the demonstrated ability of the [PSI+] prion to cause nonsense-codon read-through. Also, the dORFs and mORFs that we find have properties (e.g. growth inhibition, flocculation, vanadate resistance, stress response) that are potentially related to the ability of [PSI+] to engender substantial phenotypic variation in yeast strains under different environmental conditions. (See genecensus.org/pseudogene for further information.) Copyright 2002 Elsevier Science Ltd.
Key Words: translation termination; bioinformatics; genome annotation; pseudogene; yeast strains
Figure 1 | Figure 2 | Table 1 | Table 2
A disabled open reading frame (dORF) is defined as an ORF that is disabled by premature stop codons or frameshifts. Primarily, such dORFs are likely to be pseudogenes. Pseudogenes are "dead" copies of genes whose disablements imply that they do not form a full-length, functional protein chain. Two forms of pseudogenes generally occur: "processed" pseudogenes, where an mRNA transcript is reverse transcribed and re-integrated into the genome;1 and "non-processed" pseudogenes, which arise from duplication of a gene in the genomic DNA and subsequent disablement.2 Pseudogene populations have been described for human chromosomes 21 and 22, for the worm and for the prokaryotes Mycobacterium leprae, Yersinia pestis and Rickettsia prowazekii.3-9 In the prokaryotes and in yeast, because of the shorter generation time such pseudogenes are likely to be "strain-specific", with proteins falling in and out of use because of environmental pressures peculiar to a particular strain. In yeast, there are no processed pseudogenes,10 but there are a few documented pseudogenes that have presumably arisen from duplication (see MIPS and SGD databases11, 12).
Apart from pseudogenes, dORFs with a single disablement may be examples of sequencing errors. Finally, dORFs with a single frameshift may arise as examples of +1 or -1 programmed ribosomal frameshifting. There is at present one verified example of either of these in the yeast genome.13, 14
Determination of the extent and characteristics of the pool of dORFs in the sequenced yeast genome is important for furthering our understanding of yeast proteome evolution. Furthermore, it may shed light on effects of the [PSI+] prion on stop-codon read-through and the engendering of phenotypic diversity in yeast.15
Finding dORFs in the sequenced yeast genome
Since the full extent of the dORF complement in yeast is not known at present, here we have defined the yeast dORF pool using a simple homology-based procedure. As described in detail in Figure 1(a), the yeast genome was scanned for significant protein homologies that contain at least one disablement and that do not rely on alignment to a previously annotated ORF in the genomic DNA. That is, if the dORF entails an annotated ORF, the disabled extension to the ORF arises from a significant span of homology. The most appropriate dORF was then formed around each suitable disabled protein homology fragment (Figure 1(a)).
With our homology-based procedure, we find 183 dORFs. We also collated existing annotations of a further 38 dORFs and pseudogenic fragments from Genolevures hemi-ascomycete sequencing16 and from MIPS12 (17 from MIPS, 21 from Genolevures; Figure 1 and Table 1). This gives a grand total of up to 221 dORFs from all sources (Figure 1(c)). Of the 183 homology dORFs that we find, 98 (54%) of them have verifying homology to either a known yeast protein or a non-yeast protein (Figure 2(b)). Known yeast proteins are those that have classes 1 through 3 in the MIPS ORF classification.12 We focus on this core pool of 98 dORFs here as a verified set that was derived uniformly by a single procedure, setting aside those dORFs that are homologous only to yeast hypothetical proteins and those based only on existing annotations. Core-pool dORFs with three or less disablements are given in Table 1, along with existing dORF annotations from the MIPS/Genolevures databases that could be discerned to have three or less disablements.
Additionally, we searched for pairs of existing annotated ORFs that are adjacent along the chromosome, and could be merged by stop codon read-through for the 5' ORF of the pair, forming a single complete ORF (Figure 1(b)). We found 20 pairs of such merged ORFs, or mORFs (Table 2). One could consider this an additional method for finding dORFs with a single stop codon, but only those that arise from existing annotations, and that would form a whole ORF in a different yeast strain.
Properties of yeast dORFs
We examined the core pool of dORFs as follows: (1) their distribution of disablements; (2) their homology trends; (3) their prevalent families; and (4) their chromosomal distribution.
Most dORFs are substantially decayed. The distribution of the number of disablements is shown for the core pool of dORFs (in Figure 2(a)); 61% (60/98) have four or more disablements. In this set, 14 dORFs have one disablement, and eight of these a single premature stop codon (Table 1). An additional seven dORFs that are homologous only to hypothetical yeast proteins have a single disablement (one with a premature stop).
The existence of dORFs with single stop codons could be of relevance to the effects of the [PSI+] prion. Therefore, we checked the dORFs that we found by re-sequencing them (described in the legend to Figure 1(a)). We were able to amplify PCR products for six dORFs that were in non-repetitive regions, and verified the premature stop codons for each of them.
For some insight into strain-specific variation, we looked in more detail at the homology relationships of the 98 core-pool dORFs. Over half (54%) of these dORFs are specific to the Saccharomyces cerevisiae species, having no homology to non-yeast proteins (Figure 2(b)).
Four-fifths of the known yeast proteins (MIPS ORF classes 1 to 312) are homologous to a non-yeast protein. In comparison, only about two-fifths (41%) of the dORFs that are homologous to a known yeast protein are homologous also to a non-yeast protein (Figure 2(b)). These homology trends change only slightly (±2%) upon inclusion of the dORFs and pseudogenic fragments from the MIPS and Genolevures databases.
Furthermore, from the grand total of 221 dORFs, there are only a small number of dORFs (11) that correspond to "live" ORFs with no living relatives. One example is a very decayed reading frame of the KSH killer toxin corresponding to the single live KSH copy in the proteome (this protein also has no orthologs).
Families of dORFs with three or more members are listed (Figure 1(c)). The family related to the growth inhibitor GIN1117 stands out as the largest (16 members). The large population of growth-inhibitor dORFs may indicate that these vary in copy number for different yeast strains. The next largest family is the flocculins. These proteins have a variety of roles related to cell-cell adhesion, and are involved in mating, invasive growth and pseudohyphal formation in response to environmental stresses.18 Pseudogenes for these have been discussed.19 Most important of these is FLO8, which has a single stop-codon mutation in the laboratory strain S288C that prevents flocculation and filamentous growth (Table 1).20 There are also five DEAD-box helicase dORFs (which is an abundant ORF family in yeast, Figure 1(c)) and three for the SRP/TIP1 family, which are involved in environmental stress response.
Highly increased density of dORFs at telomeres
We observe a highly increased density of dORFs at the telomeres of the chromosomes (Figure 2(c)). Out of our core pool of 98 verified dORFs, 43 (44%) are subtelomeric, i.e. in the first and last 20 kb of the chromosomes. These include all of the dORFs for the two largest families, the flocculins and growth inhibitors noted in the previous section. If the 38 additional MIPS and Genolevures annotations are included, the proportion of dORFs in these telomeric intervals drops slightly (to 36%). An even larger number of dORFs occur in the subtelomeric regions that are homologous only to hypothetical proteins (64 in the first and last 20 kb of the chromosomes out of the total of 85 non-verified dORFs that we find). Also, a quarter (5/20) of the mORFs are in the first and last 20 kb of the chromosomes. In comparison, the proportion of total gene annotations in these 20 kb telomeric intervals is very small (~4%) (Figure 2(c)). These data indicate clearly the existence of a dynamically evolving subtelomeric subproteome in yeast.
Expression of dORFs
We tested a small random sample of 11 dORFs for expression (Figure 2(d)). Four of these showed appreciable expression, even though one has two disablements, and the other three have five or more disablements. Two of these four dORFs are subtelomeric (within 20 kb from chromosome ends), and homologous to putative hypothetical ORFs, representing dORF families of nine or more members. The other two are single dORFs with moderate sequence similarity for two annotated ORFs, both with five or more disablements; it is intriguing that we can still detect expression of these dORFs, an observation suggesting that these sequences, at minimum, possess functional promoters, and are still detectable despite nonsense-mediated decay (NMD).21
Implications for proteome evolution
A dynamically evolving subtelomeric subproteome and its role in strain-specific variation
The total pool of dORFs and pseudogenic fragments corresponds to only a very small percentage of the total annotated proteome (~3%). However, the distribution of these dORFs, both in terms of homology and chromosomal position, details an important perspective on yeast proteome evolution.
In the present study, we have found that dORFs are half as likely to be related to a non-yeast protein (~40% of dORFs) as to the average known yeast protein (80% of annotated ORFs). This comparison implies that there has been no major change in the recent evolutionary dynamics of the yeast proteome. That is, it appears that disablement attacks evolutionarily young ORFs preferentially as opposed to ancient ORFs that are conserved between species. Also, there is a dramatically increased density of dORFs near the telomeres; as noted above, the two largest families of dORFs (flocculins and growth inhibitors) are subtelomeric and are related to subtelomeric ORFs. Additionally, a third interesting subtelomeric family that is classed as hypothetical but has a large number of dORFs (six compared to 21 live ORFs), is the DUP family of putative membrane proteins, which has an InterPro motif,22 and whose expression may be pheromone-responsive.23 It is interesting to note that subtelomeric regions can be meiotic recombination "coldspots".24
We have shown that some dORFs can still be expressed despite their disabled state, and may be more refractive to NMD in some way. This implies that such dORFs are still live to some extent, and represent a store of coding information.
Implications for the effects of the [PSI+] prion
[PSI+] is an inheritable phenomenon in yeast that is caused by the propagation of an alternatively folded, amyloid-like form of the Sup35p protein.25, 26 Sup35p is part of the surveillance complex in yeast that controls mRNA NMD and translation termination.27 The occurrence of the [PSI+] prion in a yeast strain thus can lead to decreased translation termination efficiency as a result of stop-codon readthrough (SCRT), and increase the likelihood that a protein will be formed from a dORF with a premature stop codon. SCRT for the ade gene has been used since the mid-1960 s as the standard protocol to detect the presence of [PSI+].25, 29 Different yeast strains show widely varied phenotypes for growth and viability in different environments depending on whether [PSI+] is present.15, 27 Thus, arguably, different levels of increased SCRT in yeast strains may be involved in causing this prion-engendered variability. It is possible that ribosomal frameshifting may be under the influence of the surveillance complex and consequently of [PSI+].29 Although the sequenced yeast strain S288C is not a potent carrier of [PSI+], we examine below the size and make-up of our yeast dORF pool, particularly those that involve one stop codon, for the implications of [PSI+]-engendered phenotypic diversity in yeast.
The highest levels of [PSI+]-related SCRT for yeast strains that we can find in the literature are ~30%,27, 29 with base-line levels in [psi-] cells of up to 5%.27, 29 This implies that, assuming SCRT events are independent, ORFs with two or more stop codons are unlikely to produce substantial levels of encoded protein, even with [PSI+].
Consequently, we can use our data to estimate the size of the pool of sequence entities in a yeast strain that could be affected by SCRT caused by [PSI+]. We find that there is only a rather small cohort of 35 protein sequences that could be acted on readily by [PSI+] in this way. This comprises the set of all dORFs with a single premature stop codon, plus the mORFs that we detected (see the inset in Figure 1(c) for an explanation of this data set). This set of 35 entities corresponds to less than 1% of the whole yeast proteome. Its small size suggests that minor extensions to existing annotated ORFs that are not detectable by homology may play a role in engendering phenotypic diversity in yeast.15, 27 On average, a yeast ORF would be extended by 17(±24) amino acid residues by SCRT; this may be long enough to add an additional secondary structure to a domain or a transmembrane helix.
The dORFs with a single stop codon (in Table 1), and the prevalent dORF families (Figure 1(c)) show characteristics that may be relevant to phenotypes arising from SCRT. As the presence of [PSI+] produces widely different growth phenotypes for different yeast strains, the number and state of decay of dORFs of the growth inhibitors (related to Gin11p) may have a bearing on [PSI+] strain-specific growth-rates.15 The dORFs related to SRP stress-response proteins may have a role in cold-shock response. Of the single stop-codon dORFs that we observe, an extra viable copy of the fermentation enzyme aryl-alcohol reductase or of the drug-resistance pump SGE1 (Table 1) may prove beneficial for growth on different media. Finally, variation in flocculence (clumping from cell-cell adhesion) was observed in the recent study by True & Lindquist15 on phenotypic diversity engendered by [PSI+]. Here, flocculins (which cause such cell-cell adhesion;19) comprise a large dORF family (Figure 1(c)), including three singly disabled dORFs. Variability in the number of distinct flocculins may help maintain a degree of strain-specific variation in cell adhesion properties. Flocculins are involved also in environmental stress response.18
We have detected mRNA transcripts corresponding to four dORFs possessing varying degrees of coding disability (Figure 2(d)). From this observation, we can suggest that the dORFs are real sequence entities and that disablements in coding sequence do not necessarily prohibit corresponding detectable mRNA sequence expression. The detected expression may imply that some dORFs are more refractive to NMD in some way, or may be interesting candidates for more detailed and comprehensive study of SCRT and the potential effects of [PSI+].
There are some interesting examples of mORFs that may have relevance for [PSI+] phenotypic diversity effects (Table 2). Note, however, that a large proportion of the ORFs involved (16/40) are hypothetical and that these MORFs may be complete ORFs in other yeast. For example:
YBR226C-YBR227C, a mitochondrial chaperone can be readthrough into from a hypothetical protein (predicted to be mitochondrial30); modification of the activity of this protein may affect mitochondrial protein homeostasis.
YHR057C-YHR058C, a peptidyl-prolyl isomerase can be N-terminally tagged onto a transcriptional regulation protein. These are clearly disparate functions; disruption of the latter ORF is lethal to yeast cells, so this fusion may decrease yeast-cell viability.
YER039C-YER039C-A, HVG1, which has strong similarity to vanadate-resistance protein (GOG5), can be readthrough into a short hypothetical protein (YER039C-A, 72 amino acid residues). This last pairing is particularly notable, since one yeast strain (with SCRT levels of ~26%) showed a decreased growth-rate in the presence of vanadate when carrying [PSI+].15 Also, HVG1 is the only paralog of GOG5 in the sequenced yeast strain S288C.
The mORFs we detected have linking nucleotide sequences of varying length (from one to 262 nucleotides, with a mean of 31). One could consider them as dORFs, but those that arise only from two existing ORF annotations; we assume that such mORFs could be complete ORFs in another yeast strain.
The dORF annotation data and sequences are available at the website http://genecensus.org/pseudogene (or http://bioinfo.mbb.yale.edu/ genome/pseudogene).
We thank Tricia Serio and Zhaolei Zhang for comments on the manuscript. A.K. is supported by a postdoctoral fellowship from the American Cancer Society. M.G. acknowledges support from the NIH protein structure initiative (P50 grant GM62413-01).
Andersson S. G. , Zomorodipour A. , Andersson J. O. , Sicheritz-Ponten T. , Alsmark U. C. M. , Podowski R. M. et al. (1998). The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature , 396, 133--140 [Medline] [ChemPort]
Parkhill J. , Wren B. W. , Thomson N. R. , Titball R. W. , Holden M. T. , Prentice M. B. et al. (2001). Genome sequence of Yersinia pestis, the causative agent of plague. Nature , 413, 467--470 [Medline]
Hattori M. , Fujiyama A. , Taylor T. D. , Watanabe H. , Yada T. , Park H. S. et al. (2000). The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium. Nature , 405, 311--319 [Medline] [ChemPort]
Harrison P. M. , Echols N. & Gerstein M. (2001). Digging for dead genes: an analysis of the characteristics and distribution of the pseudogene population in the C. elegans genome. Nucl. Acids Res. 29, 818--830 [CrossRef] [Medline] [ChemPort]
Harrison P. M. , Hegyi H. , Balasubramaniam S. , Luscombe N. , Bertone P. , Echols N. et al. (2002). Molecular fossils in the human genome: Identification and analysis of the pseudogenes on chromosomes 21 and 22. Genome Res. 12, 273--281 [Medline]
Cole S. T. , Eigimeier K. , Parkhill J. , James K. D. , Thomson N. R. , Wheeler P. R. et al. (2001). Massive gene decay in the leprosy bacillus. Nature , 409, 1007--1011 [CrossRef] [Medline] [ChemPort]
Blandin G. , Durrens P. , Tekaia F. , Aigle M. , Bolotin-Fukuhara M. , Bon E. et al. (2000). Genomic exploration of the hemiascomycetous yeasts: 4. The genome of Saccharomyces cerevisiae revisited. FEBS Letters , 487, 31--36 [CrossRef] [Medline] [ChemPort]
Kawahata M. , Amari S. , Nishizawa Y. & Akada R. (1999). A positive selection for plasmid loss in S. cerevisiae using galactose-inducible growth-inhibitory sequences. Yeast , 15, 1--10 [CrossRef] [Medline] [ChemPort]
Apweiler R. , Attwood T. K. , Bairoch A. , Bateman A. , Birney E. , Biswas M. et al. (2000). InterPro-an integrated documentation resource for protein families, domains and functional sites. Bioinformatics , 16, 1145--1150 [CrossRef] [Medline] [ChemPort]
Gerton J. L. , DeRisi J. , Shroff R. , Lichten M. , Brown P. O. & Petes T. D. (2000). Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA , 97, 11383--11390 [CrossRef] [Medline] [ChemPort]
Eaglestone S. S. , Cox B. S. & Tuite M. F. (1999). Translation termination efficiency can be regulated in S. cerevisiae by environmental stress through a prion-mediated mechanism. EMBO J. 18, 1974--1981 [CrossRef] [Medline] [ChemPort]
Bidou L. , Stahl G. , Hatin I. , Namy O. , Rousset J. P. & Farabaugh P. J. (2000). Nonsense-mediated decay mutants do not affect programmed -1 frameshifting. RNA , 6, 952--961 [CrossRef] [Medline] [ChemPort]
Drawid A. & Gerstein M. (2001). A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Biol. 301, 1059--1075 [IDEAL] [CrossRef] [Medline]
Altschul S. F. , Madden T. L. , Schaffer A. A. , Zhang J. , Zhang Z. , Miller W. & Lipman D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389--3402 [CrossRef] [Medline] [ChemPort]
Celniker M. D. , Holt S. E. , Evans R. A. , Gocayne C. A. , Amanatides J. D. , Scherer P. G. et al. (2000). The genome sequence of Drosophila melanogaster. Science , 287, 2185--2195 [CrossRef] [Medline]
Gerstein M. (1997). A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274, 562--576 [IDEAL] [CrossRef] [Medline] [ChemPort]
Hegyi H. & Gerstein M. (1999). The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147--164 [IDEAL] [CrossRef] [Medline] [ChemPort]