jmbi.2001.5343

New, Smarter IDEAL Browse Deserves a Look!

Click here for information on Dendritic Cells

Table of Contents • Article(PDF) • References

A Small Reservoir of Disabled ORFs in the Yeast Genome and its Implications for the Dynamics of Proteome Evolution

pp. 409-419 (doi:10.1006/jmbi.2001.5343)

Paul Harrison¹, Anuj Kumar², Ning Lan¹, Nathaniel Echols¹, Michael Snyder^1, 2, Mark Gerstein¹

¹Department of Molecular Biophysics & Biochemistry, and
²Department of Molecular Cellular & Developmental Biology, Yale University, 266 Whitney Ave., P.O. Box 208114, New Haven, CT, 06520-8114, USA

Search for Articles by:

Harrison, P [IDEAL]

Kumar, A [IDEAL]

Lan, N [IDEAL]

Echols, N [IDEAL]

Snyder, M [IDEAL]

Gerstein, M [IDEAL]

(Received 26 June 2001; received in revised form 26 November 2001; accepted 26 November 2001; published electronically February 26, 2002)

Abstract

We surveyed the sequenced Saccharomyces cerevisiae genome (strain S288C) comprehensively for open reading frames (ORFs) that could encode full-length proteins but contain obvious mid-sequence disablements (frameshifts or premature stop codons). These pseudogenic features are termed disabled ORFs (dORFs). Using homology to annotated yeast ORFs and non-yeast proteins plus a simple region extension procedure, we have found 183 dORFs. Combined with the 38 existing annotations for potential dORFs, we have a total pool of up to 221 dORFs, corresponding to less than ~3% of the proteome. Additionally, we found 20 pairs of annotated ORFs for yeast that could be merged into a single ORF (termed a mORF) by read-through of the intervening stop codon, and may comprise a complete ORF in other yeast strains. Focussing on a core pool of 98 dORFs with a verifying protein homology, we find that most dORFs are substantially decayed, with ~90% having two or more disablements, and ~60% having four or more. dORFs are much more yeast-proteome specific than live yeast genes (having about half the chance that they are related to a non-yeast protein). They show a dramatically increased density at the telomeres of chromosomes, relative to genes. A microarray study shows that some dORFs are expressed even though they carry multiple disablements, and thus may be more resistant to nonsense-mediated decay. Many of the dORFs may be involved in responding to environmental stresses, as the largest functional groups include growth inhibition, flocculation, and the SRP/TIP1 family. Our results have important implications for proteome evolution. The characteristics of the dORF population suggest the sorts of genes that are likely to fall in and out of usage (and vary in copy number) in a strain-specific way and highlight the role of subtelomeric regions in engendering this diversity. Our results also have important implications for the effects of the [PSI+] prion. The dORFs disabled by only a single stop and the mORFs (together totalling 35) provide an estimate for the extent of the sequence population that can be resurrected readily through the demonstrated ability of the [PSI+] prion to cause nonsense-codon read-through. Also, the dORFs and mORFs that we find have properties (e.g. growth inhibition, flocculation, vanadate resistance, stress response) that are potentially related to the ability of [PSI+] to engender substantial phenotypic variation in yeast strains under different environmental conditions. (See genecensus.org/pseudogene for further information.) Copyright 2002 Elsevier Science Ltd.

Key Words: translation termination; bioinformatics; genome annotation; pseudogene; yeast strains

Article

Figure 1 | Figure 2 | Table 1 | Table 2

A disabled open reading frame (dORF) is defined as an ORF that is disabled by premature stop codons or frameshifts. Primarily, such dORFs are likely to be pseudogenes. Pseudogenes are "dead" copies of genes whose disablements imply that they do not form a full-length, functional protein chain. Two forms of pseudogenes generally occur: "processed" pseudogenes, where an mRNA transcript is reverse transcribed and re-integrated into the genome;¹ and "non-processed" pseudogenes, which arise from duplication of a gene in the genomic DNA and subsequent disablement.² Pseudogene populations have been described for human chromosomes 21 and 22, for the worm and for the prokaryotes Mycobacterium leprae, Yersinia pestis and Rickettsia prowazekii.^3-9 In the prokaryotes and in yeast, because of the shorter generation time such pseudogenes are likely to be "strain-specific", with proteins falling in and out of use because of environmental pressures peculiar to a particular strain. In yeast, there are no processed pseudogenes,¹⁰ but there are a few documented pseudogenes that have presumably arisen from duplication (see MIPS and SGD databases^{11, 12}).

Apart from pseudogenes, dORFs with a single disablement may be examples of sequencing errors. Finally, dORFs with a single frameshift may arise as examples of +1 or -1 programmed ribosomal frameshifting. There is at present one verified example of either of these in the yeast genome.^{13, 14}

Determination of the extent and characteristics of the pool of dORFs in the sequenced yeast genome is important for furthering our understanding of yeast proteome evolution. Furthermore, it may shed light on effects of the [PSI+] prion on stop-codon read-through and the engendering of phenotypic diversity in yeast.¹⁵

Finding dORFs in the sequenced yeast genome

Since the full extent of the dORF complement in yeast is not known at present, here we have defined the yeast dORF pool using a simple homology-based procedure. As described in detail in Figure 1(a), the yeast genome was scanned for significant protein homologies that contain at least one disablement and that do not rely on alignment to a previously annotated ORF in the genomic DNA. That is, if the dORF entails an annotated ORF, the disabled extension to the ORF arises from a significant span of homology. The most appropriate dORF was then formed around each suitable disabled protein homology fragment (Figure 1(a)).

( larger image: 5K )

( larger image: 22K )

Figure 1 dORF and mORF detection. (a) dORFs from disabled protein homology. Initially, the complete sequenced genome of yeast³¹ was searched in six-frame translation against the SWISSPROT protein sequence database³² and yeast proteome sequence data from SGD (http://genome-www.stanford.edu/Saccharomyces, downloaded May 2000) and MIPS (http://mips.gsf.de, downloaded May 2000), using the alignment program TFASTX/Y.³³ Low complexity was masked using SEG.³⁴ All protein matches that overlapped genomic features such as transposable elements and tRNA genes were deleted. All significant protein matches (e-value $\lt=$ 0.01) were reduced for overlap by selecting homology segments in decreasing order of significance and flagging any others that overlap them for deletion. Matched stretches of genomic DNA that contained any disablements (either frameshifts or stop codons) were then further examined by comparing to the matching protein, a larger segment of the genomic DNA that had been extended at either end by the size of the matching protein sequence (in the equivalent number of nucleotides). This was performed with the FASTX/Y program. These enlarged homology fragments (denoted by the grey box) were then extended into the most appropriate ORFs, by searching for the nearest downstream stop codon (filled dot, TGA given as an example), and the farthest upstream start codon (open dot, labelled ATG at position A), or failing that, the nearest upstream start codon, after the nearest upstream stop (shown at position B). All such generated ORFs were then inspected manually, and reduced for overlap with each other where a larger predicted dORF comprises a similar shorter one. After this initial search for dORFs, we performed a second more comprehensive search for homology using PSI-BLAST.³⁵ We extracted all possible ORFs of size $\gt=$ 30 codons from the yeast genome (i.e. all stretches of genomic DNA beginning with start codon and ending with a stop codon) and searched them (in translation) against SWISSPROT³² plus the combined annotated proteomes of Caenorhabditis elegans,³⁶ Arabidopsis thaliana,³⁷ Drosophilia melanogaster,³⁸ S. cerevisiae itself and 18 prokaryotes. All significant protein matches (using default threshold values) were again selected and processed as above for the original searches to find additional dORFs, again using FASTX/Y in the re-alignment stage. Those found only with PSIBLAST are labelled in Table 1. To gather existing annotation on potential dORFs, we examined the MIPS database¹² for any annotated pseudogenes, or ORFs reported to have stop codons or frameshifts. Also, from the Genolevures hemi-ascomycete sequencing project,¹⁶ there are 17 examples of ORF extensions that may be potential dORFs (five singly disabled) that were not found by our disabled homology-searching procedure. Generally, these ORF extensions could be sequencing errors or (strain-specific) pseudogenes. All dORFs were checked against yeast chromosome sequence updates at http://genome-www.stanford.edu/Saccharomyces, resulting in the deletion of one dORF from the list. All yeast ORF classifications are taken from the MIPS database as of May 2000; known ORFs are those with classes 1 through 3. Sequencing to estimate stop codon errors: putative disablements were verified experimentally within all six non-repetitive and previously unidentified dORFs possessing a single premature stop codon. For purposes of this analysis, genomic DNA was extracted from a derivative of S. cerevisiae strain S288C. A region of this DNA encompassing each predicted premature stop codon was amplified using the polymerase chain reaction (PCR); PCR-amplified products were subsequently sequenced on both strands by standard methods (i.e. cycle-sequencing using big-dye terminators). By this approach, the presence of each premature termination codon was confirmed unambiguously. (b) Mergeable pairs of ORFs (mORFs). All adjacent pairs of annotated ORFs in the yeast genome (denoted by white boxes) were assessed for whether the 5' partner of the pair could merge into the 3' partner if the stop codon of the former is changed to a sense codon. If the two ORFs can form a larger ORF, ignoring the intervening stop codon, then the complete disabled reading frame is termed a mORF. (c) Classification of dORFs and mORFs. In the top panel, the tree shows the breakdown of the grand total of 221 dORFs into the 183 homology dORFs that we detected by our procedure, and the 38 additional annotations for dORFs and pseudogenic fragments culled from the MIPS and Genolevures databases.^{12, 16} The 183 homology dORFs separate into 98 dORFs with a verifying homology to a non-yeast protein or to a known yeast protein, and 85 dORFs that are homologous only to hypothetical ORFs. The inset panel at the bottom right describes the breakdown of the entities with single disablements (both dORFs and mORFs). Here, the dORFs for each of the four main groupings are shown with boxes of the same colour as in the top panel. The totals are split (frameshifts plus stops). The dORFs with single stops combined with the 20 mORFs give a total of 35 entities with single stop-codons. The inset panel at the bottom left shows the families in the dORF pool that have three or more members, with their corresponding numbers of ORFs. dORF families were derived using a modification of the algorithm described earlier.^39-41 dORFs and ORFs were deemed related if they have an alignment score of 1×10^-4or less for BLASTP.³⁵

Click here to see the table

Table 1 dORFs with three or fewer disablements

With our homology-based procedure, we find 183 dORFs. We also collated existing annotations of a further 38 dORFs and pseudogenic fragments from Genolevures hemi-ascomycete sequencing¹⁶ and from MIPS¹² (17 from MIPS, 21 from Genolevures; Figure 1 and Table 1). This gives a grand total of up to 221 dORFs from all sources (Figure 1(c)). Of the 183 homology dORFs that we find, 98 (54%) of them have verifying homology to either a known yeast protein or a non-yeast protein (Figure 2(b)). Known yeast proteins are those that have classes 1 through 3 in the MIPS ORF classification.¹² We focus on this core pool of 98 dORFs here as a verified set that was derived uniformly by a single procedure, setting aside those dORFs that are homologous only to yeast hypothetical proteins and those based only on existing annotations. Core-pool dORFs with three or less disablements are given in Table 1, along with existing dORF annotations from the MIPS/Genolevures databases that could be discerned to have three or less disablements.

( larger image: 15K )

( larger image: 6K )

( larger image: 23K )

( larger image: 6K )

Figure 2 Analysis of the dORF reservoir. (a) The distribution of the number of disablements. This is shown for the core pool of 98 verified dORFs. The total for singly disabled dORFs is divided into those with a single frameshift (dark bar) and those with a single premature stop codon (white bar). The total disablements for "15+" includes all those counts greater than 15. Additionally, as can be deduced from Table 1, eight out of the 21 Genolevures-derived dORFs have at least four disablements. Disablements for the MIPS-annotated dORFs are not determined readily, as some of them are ORF truncations and pairs of homology fragments that would not be detected by our procedure. Those for which we could define the number of disablements are listed in Table 1. (b) Homology classification of dORFs. The distribution of the dORFs into those that have a non-yeast proteome homolog but no known yeast protein homolog (denoted N!K in Table 1), those that have a known yeast protein homolog but no non-yeast homolog (denoted K!N), and those that have both (denoted N&K). Inclusion of the homology trends for the extra MIPS and Genolevures annotations changes the representation of these categories only slightly (±2% at most). (c) Highly increased density of dORFs at the telomeres. Distribution of dORFs (top panel) and ORFs (bottom panel) at the telomeres versus the remainder of the yeast chromosomes. The total number of dORFs and ORFs are shown in 10 kb intervals from both ends of all 16 yeast chromosomes (totalled together). In the bottom panel, known ORFs are shown with grey bars and hypothethical ORFs are shown with black ones. In the top panel, dORFs are divided into those homologous only to hypothetical yeast proteins (black bars) and the remainder (grey bars). The inset graphs show the total number of dORFs (upper panel) and ORFs (lower panel) within 20 kb from both telomeres and in the remaining span of the chromosomes. (d) Detected expression of dORFs. To investigate expression of dORF sequences, a sampling of 11 predicted dORFs were subjected to dot blot analysis using strand-specific oligonucleotides in an array-based format. For this analysis, poly(A) RNA was extracted from a vegetatively growing diploid S288C derivative; extracted RNA was treated with DNase I and subsequently biotinylated using the BrightStar^TM Psoralen-Biotin kit (Ambion, Austin, TX). Biotinylated RNA was used to probe an array of 50-60-mer oligonucleotides spotted onto a nylon membrane-coated glass slide (Schleicher and Schuell, Keene, NH). Oligonucleotide sequences were derived from each putative dORF coding region and were selected to avoid repeated segments. Arrayed oligonucleotides were hybridized against 200 ng of biotinylated poly(A) RNA supplemented with denatured salmon sperm DNA at a final concentration of 100 µg/ml. Hybridizations were carried out in buffer containing formamide at 45(C. Bound RNA was detected using the BrightStar^TM BioDetect^TM kit (Ambion, Austin, TX). Spot size and intensity were quantified using software distributed in the NIH Image package version 1.62 (rsb.info.nih.gov/nih-image). Four dORF transcripts detected at levels appreciably distinct from background are shown here (lanes 2, 4, 5 and 7). These are homologs of the yeast ORFs YGR293c, YNL338w, YIL058w and YKL221w respectively. Lane C (negative control) indicates a lack of observable binding associated with hybridization against a non-coding region of the yeast genome. This dot blot analysis cannot be used to distinguish between transcripts greater than 75% identical. As lane 2 and lane 4 are each representative of larger dORF families, this analysis indicates that at least one dORF from each of these previously unappreciated families is expressed under conditions of vegetative growth.

Additionally, we searched for pairs of existing annotated ORFs that are adjacent along the chromosome, and could be merged by stop codon read-through for the 5' ORF of the pair, forming a single complete ORF (Figure 1(b)). We found 20 pairs of such merged ORFs, or mORFs (Table 2). One could consider this an additional method for finding dORFs with a single stop codon, but only those that arise from existing annotations, and that would form a whole ORF in a different yeast strain.

Click here to see the table

Table 2 mORFs

Properties of yeast dORFs

We examined the core pool of dORFs as follows: (1) their distribution of disablements; (2) their homology trends; (3) their prevalent families; and (4) their chromosomal distribution.

Disablements

Most dORFs are substantially decayed. The distribution of the number of disablements is shown for the core pool of dORFs (in Figure 2(a)); 61% (60/98) have four or more disablements. In this set, 14 dORFs have one disablement, and eight of these a single premature stop codon (Table 1). An additional seven dORFs that are homologous only to hypothetical yeast proteins have a single disablement (one with a premature stop).

The existence of dORFs with single stop codons could be of relevance to the effects of the [PSI+] prion. Therefore, we checked the dORFs that we found by re-sequencing them (described in the legend to Figure 1(a)). We were able to amplify PCR products for six dORFs that were in non-repetitive regions, and verified the premature stop codons for each of them.

Homology trends

For some insight into strain-specific variation, we looked in more detail at the homology relationships of the 98 core-pool dORFs. Over half (54%) of these dORFs are specific to the Saccharomyces cerevisiae species, having no homology to non-yeast proteins (Figure 2(b)).

Four-fifths of the known yeast proteins (MIPS ORF classes 1 to 3¹²) are homologous to a non-yeast protein. In comparison, only about two-fifths (41%) of the dORFs that are homologous to a known yeast protein are homologous also to a non-yeast protein (Figure 2(b)). These homology trends change only slightly (±2%) upon inclusion of the dORFs and pseudogenic fragments from the MIPS and Genolevures databases.

Furthermore, from the grand total of 221 dORFs, there are only a small number of dORFs (11) that correspond to "live" ORFs with no living relatives. One example is a very decayed reading frame of the KSH killer toxin corresponding to the single live KSH copy in the proteome (this protein also has no orthologs).

Prevalent families

Families of dORFs with three or more members are listed (Figure 1(c)). The family related to the growth inhibitor GIN11¹⁷ stands out as the largest (16 members). The large population of growth-inhibitor dORFs may indicate that these vary in copy number for different yeast strains. The next largest family is the flocculins. These proteins have a variety of roles related to cell-cell adhesion, and are involved in mating, invasive growth and pseudohyphal formation in response to environmental stresses.¹⁸ Pseudogenes for these have been discussed.¹⁹ Most important of these is FLO8, which has a single stop-codon mutation in the laboratory strain S288C that prevents flocculation and filamentous growth (Table 1).²⁰ There are also five DEAD-box helicase dORFs (which is an abundant ORF family in yeast, Figure 1(c)) and three for the SRP/TIP1 family, which are involved in environmental stress response.

Highly increased density of dORFs at telomeres

We observe a highly increased density of dORFs at the telomeres of the chromosomes (Figure 2(c)). Out of our core pool of 98 verified dORFs, 43 (44%) are subtelomeric, i.e. in the first and last 20 kb of the chromosomes. These include all of the dORFs for the two largest families, the flocculins and growth inhibitors noted in the previous section. If the 38 additional MIPS and Genolevures annotations are included, the proportion of dORFs in these telomeric intervals drops slightly (to 36%). An even larger number of dORFs occur in the subtelomeric regions that are homologous only to hypothetical proteins (64 in the first and last 20 kb of the chromosomes out of the total of 85 non-verified dORFs that we find). Also, a quarter (5/20) of the mORFs are in the first and last 20 kb of the chromosomes. In comparison, the proportion of total gene annotations in these 20 kb telomeric intervals is very small (~4%) (Figure 2(c)). These data indicate clearly the existence of a dynamically evolving subtelomeric subproteome in yeast.

Expression of dORFs

We tested a small random sample of 11 dORFs for expression (Figure 2(d)). Four of these showed appreciable expression, even though one has two disablements, and the other three have five or more disablements. Two of these four dORFs are subtelomeric (within 20 kb from chromosome ends), and homologous to putative hypothetical ORFs, representing dORF families of nine or more members. The other two are single dORFs with moderate sequence similarity for two annotated ORFs, both with five or more disablements; it is intriguing that we can still detect expression of these dORFs, an observation suggesting that these sequences, at minimum, possess functional promoters, and are still detectable despite nonsense-mediated decay (NMD).²¹

Implications for proteome evolution

A dynamically evolving subtelomeric subproteome and its role in strain-specific variation

The total pool of dORFs and pseudogenic fragments corresponds to only a very small percentage of the total annotated proteome (~3%). However, the distribution of these dORFs, both in terms of homology and chromosomal position, details an important perspective on yeast proteome evolution.

In the present study, we have found that dORFs are half as likely to be related to a non-yeast protein (~40% of dORFs) as to the average known yeast protein (80% of annotated ORFs). This comparison implies that there has been no major change in the recent evolutionary dynamics of the yeast proteome. That is, it appears that disablement attacks evolutionarily young ORFs preferentially as opposed to ancient ORFs that are conserved between species. Also, there is a dramatically increased density of dORFs near the telomeres; as noted above, the two largest families of dORFs (flocculins and growth inhibitors) are subtelomeric and are related to subtelomeric ORFs. Additionally, a third interesting subtelomeric family that is classed as hypothetical but has a large number of dORFs (six compared to 21 live ORFs), is the DUP family of putative membrane proteins, which has an InterPro motif,²² and whose expression may be pheromone-responsive.²³ It is interesting to note that subtelomeric regions can be meiotic recombination "coldspots".²⁴

We have shown that some dORFs can still be expressed despite their disabled state, and may be more refractive to NMD in some way. This implies that such dORFs are still live to some extent, and represent a store of coding information.

Implications for the effects of the [PSI+] prion

[PSI+] is an inheritable phenomenon in yeast that is caused by the propagation of an alternatively folded, amyloid-like form of the Sup35p protein.^{25, 26} Sup35p is part of the surveillance complex in yeast that controls mRNA NMD and translation termination.²⁷ The occurrence of the [PSI+] prion in a yeast strain thus can lead to decreased translation termination efficiency as a result of stop-codon readthrough (SCRT), and increase the likelihood that a protein will be formed from a dORF with a premature stop codon. SCRT for the ade gene has been used since the mid-1960 s as the standard protocol to detect the presence of [PSI+].^{25, 29} Different yeast strains show widely varied phenotypes for growth and viability in different environments depending on whether [PSI+] is present.^{15, 27} Thus, arguably, different levels of increased SCRT in yeast strains may be involved in causing this prion-engendered variability. It is possible that ribosomal frameshifting may be under the influence of the surveillance complex and consequently of [PSI+].²⁹ Although the sequenced yeast strain S288C is not a potent carrier of [PSI+], we examine below the size and make-up of our yeast dORF pool, particularly those that involve one stop codon, for the implications of [PSI+]-engendered phenotypic diversity in yeast.

The highest levels of [PSI+]-related SCRT for yeast strains that we can find in the literature are ~30%,^{27, 29} with base-line levels in [psi-] cells of up to 5%.^{27, 29} This implies that, assuming SCRT events are independent, ORFs with two or more stop codons are unlikely to produce substantial levels of encoded protein, even with [PSI+].

Consequently, we can use our data to estimate the size of the pool of sequence entities in a yeast strain that could be affected by SCRT caused by [PSI+]. We find that there is only a rather small cohort of 35 protein sequences that could be acted on readily by [PSI+] in this way. This comprises the set of all dORFs with a single premature stop codon, plus the mORFs that we detected (see the inset in Figure 1(c) for an explanation of this data set). This set of 35 entities corresponds to less than 1% of the whole yeast proteome. Its small size suggests that minor extensions to existing annotated ORFs that are not detectable by homology may play a role in engendering phenotypic diversity in yeast.^{15, 27} On average, a yeast ORF would be extended by 17(±24) amino acid residues by SCRT; this may be long enough to add an additional secondary structure to a domain or a transmembrane helix.

The dORFs with a single stop codon (in Table 1), and the prevalent dORF families (Figure 1(c)) show characteristics that may be relevant to phenotypes arising from SCRT. As the presence of [PSI+] produces widely different growth phenotypes for different yeast strains, the number and state of decay of dORFs of the growth inhibitors (related to Gin11p) may have a bearing on [PSI+] strain-specific growth-rates.¹⁵ The dORFs related to SRP stress-response proteins may have a role in cold-shock response. Of the single stop-codon dORFs that we observe, an extra viable copy of the fermentation enzyme aryl-alcohol reductase or of the drug-resistance pump SGE1 (Table 1) may prove beneficial for growth on different media. Finally, variation in flocculence (clumping from cell-cell adhesion) was observed in the recent study by True & Lindquist¹⁵ on phenotypic diversity engendered by [PSI+]. Here, flocculins (which cause such cell-cell adhesion;¹⁹) comprise a large dORF family (Figure 1(c)), including three singly disabled dORFs. Variability in the number of distinct flocculins may help maintain a degree of strain-specific variation in cell adhesion properties. Flocculins are involved also in environmental stress response.¹⁸

We have detected mRNA transcripts corresponding to four dORFs possessing varying degrees of coding disability (Figure 2(d)). From this observation, we can suggest that the dORFs are real sequence entities and that disablements in coding sequence do not necessarily prohibit corresponding detectable mRNA sequence expression. The detected expression may imply that some dORFs are more refractive to NMD in some way, or may be interesting candidates for more detailed and comprehensive study of SCRT and the potential effects of [PSI+].

There are some interesting examples of mORFs that may have relevance for [PSI+] phenotypic diversity effects (Table 2). Note, however, that a large proportion of the ORFs involved (16/40) are hypothetical and that these MORFs may be complete ORFs in other yeast. For example:

YBR226C-YBR227C, a mitochondrial chaperone can be readthrough into from a hypothetical protein (predicted to be mitochondrial³⁰); modification of the activity of this protein may affect mitochondrial protein homeostasis.

YHR057C-YHR058C, a peptidyl-prolyl isomerase can be N-terminally tagged onto a transcriptional regulation protein. These are clearly disparate functions; disruption of the latter ORF is lethal to yeast cells, so this fusion may decrease yeast-cell viability.

YER039C-YER039C-A, HVG1, which has strong similarity to vanadate-resistance protein (GOG5), can be readthrough into a short hypothetical protein (YER039C-A, 72 amino acid residues). This last pairing is particularly notable, since one yeast strain (with SCRT levels of ~26%) showed a decreased growth-rate in the presence of vanadate when carrying [PSI+].¹⁵ Also, HVG1 is the only paralog of GOG5 in the sequenced yeast strain S288C.

The mORFs we detected have linking nucleotide sequences of varying length (from one to 262 nucleotides, with a mean of 31). One could consider them as dORFs, but those that arise only from two existing ORF annotations; we assume that such mORFs could be complete ORFs in another yeast strain.

Website

The dORF annotation data and sequences are available at the website http://genecensus.org/pseudogene (or http://bioinfo.mbb.yale.edu/ genome/pseudogene).

Corresponding author

E-mail address of the corresponding author: mark.gerstein@yale.edu

Abbreviations used: ORF, open reading frame; dORF, disabled ORF; mORF, merged ORF; NMD, nonsense-mediated decay; SCRT, stop-codon readthrough

We thank Tricia Serio and Zhaolei Zhang for comments on the manuscript. A.K. is supported by a postdoctoral fellowship from the American Cancer Society. M.G. acknowledges support from the NIH protein structure initiative (P50 grant GM62413-01).

References