Genome Research -- Zhang et al. 12 (10): 1466

Institution: Yale University || Sign In as Individual

Abstract of this Article

Reprint (PDF) Version of this Article

Identification and Analysis of Over 2000 Ribosomal Protein Pseudogenes in the Human Genome

Zhaolei Zhang, Paul Harrison, and Mark Gerstein¹

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
WEB SITE REFERENCES
REFERENCES

	ABSTRACT

Mammals have 79 ribosomal proteins (RP). Using a systematic procedure based on sequence-homology, we have comprehensivelyidentified pseudogenes of these proteins in the human genome.Our assignments are available at http://www.pseudogene.org/ orhttp://bioinfo.mbb.yale.edu/genome/pseudogene. In total, we found2090 processed pseudogenes and 16 duplications of RP genes. Inrelation to the matching parent protein, each of the processedpseudogenes has an average relative sequence length of 97% andan average sequence identity of 76%. A small number (258) of themdo not contain obvious disablements (stop codons or frameshifts)and, therefore, could be mistaken as functional genes, and 178are disrupted by one or more repetitive elements. On average,processed pseudogenes have a longer truncation at the 5' end thanthe 3' end, consistent with the target-primed-reverse-transcription(TPRT) mechanism. Interestingly, on chromosome 16, an RPL26 processedpseudogene was found in the intron region of a functional RPS2gene. The large-scale distribution of RP pseudogenes throughoutthe genome appears to result, chiefly, from random insertionswith the numbers on each chromosome, consequently, proportionalto its size. In contrast to RP genes, the RP pseudogenes havethe highest density in GC-intermediate regions (41%-46%) of thegenome, with the density pattern being between that of LINEs andAlus. This can be explained by a negative selection theory aswe observed that GC-rich RP pseudogenes decay faster in GC-poorregions. Also, we observed a correlation between the number ofprocessed pseudogenes and the GC content of the associated functionalgene, i.e., relatively GC-poor RPs have more processed pseudogenes.This ranges from 145 pseudogenes for RPL21 down to 3 pseudogenesfor RPL14. We were able to date the RP pseudogenes based on theirsequence divergence from present-day RP genes, finding an agedistribution similar to that for Alus. The distribution is consistentwith a decline in retrotransposition activity in the hominid lineageduring the last 40 Myr. We discuss the implications for retrotransposonstability and genome dynamics based on these newfindings.

	INTRODUCTION

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS WEB SITE REFERENCES REFERENCES

All of the proteins in the cell are synthesized by the ribosomes, large complexes of RNA and protein molecules. A typicalmammalian cell has about 4 × 10⁶ ribosomes, and each is composed of four RNA molecules (rRNA)and 79 ribosomal proteins (RPs). In total, ribosomes constituteabout 80% of the RNA and 5%-10% of the protein in a cell (Kenmochiet al. 1998). Great progress has been made in recent years inelucidating the structure and mechanism of the ribosome. The peptidesequence of the complete set of mammalian RPs was deduced by Wooland colleagues (1995), and the genes encoding all human RPs havebeen positioned on the human genetic map (Kenmochi et al. 1998;Uechi et al. 2001; Yoshihama et al. 2002). Moreover, several high-resolutionatomic structures are now available for archaeal ribosomes (Banet al. 2000; Schluenzen et al. 2000; Wimberly et al. 2000; Yusupovet al. 2001).

Although it is well recognized that rRNA catalyzes the basic biochemistry of protein synthesis, ribosomal proteins are importantin facilitating rRNA folding, protecting them from nucleases,and coordinating the multistep process of protein synthesis. SomeRPs have substantial extra-ribosomal functions as well (Wool 1996).It is believed that RPs from all three kingdoms of life are related,probably having evolved from the same ancestral set of proteinsafter the conversion of the ribosome from an RNA complex to aribonucleoprotein particle (RNP). Among eukaryotes, the numberand sequence of cytoplasmic RPs are fairly well conserved. Forinstance, yeast and rat share all but one RP, and the sequenceidentity of their RPs ranges from 40% to 88%, with an averageof 60%. Among mammals, the amino acid sequences of the RPs arealmost identical. For example, for the 72 RPs of which amino acidsequences are available for both human and rat, the average sequenceidentity is 99%, and 32 of them are perfectly identical (Woolet al. 1995).

In the yeast cell, the 78 RPs are encoded by 137 genes; 59 of the genes are duplicated (Planta and Mager 1998). In all cases,both gene copies are transcribed although their expression levelsoften differ considerably (Raue and Planta 1991). The proteinsencoded by duplicated genes have identical or virtually identicalsequences and are functionally indistinguishable. In contrast,it is widely recognized that in mammals a single gene encodeseach RP, although most if not all of the RP genes have a numberof processed pseudogenes located elsewhere in the genome. Theexistence of these pseudogenes has greatly hindered the sequencingand mapping efforts of human RP genes, so a special intron-trappingstrategy had to be undertaken to differentiate the real transcribedRP gene and pseudogenes (Kenmochi et al. 1998; Uechi et al. 2001).A number of RP genes have also been implicated in various humandiseases, such as RPS19 in Diamond-Blackfan anemia (DBA; Draptchinskaiaet al. 1999), RPL6 in Noonan syndrome (Kenmochi et al. 2000),and RPS4X gene in Turner's syndrome (Zinn et al. 1993).

In general, pseudogenes are disabled copies of functional genes that do not produce a functional, full-length protein (Vanin1985; Mighell et al. 2000). The disablements can take the formof premature stop codons or frame shifts in the protein-codingsequence (CDS), or less obviously, deleterious mutations in theregulatory regions that control gene transcription or splicing.There are two main types of pseudogenes: duplicated (nonprocessed)and processed. Duplicated pseudogenes arise from genomic DNA duplicationor unequal crossing-over. They have the same general structureas functional genes, with sequences corresponding to exons andintrons in the usual locations. Processed pseudogenes result fromretrotransposition, that is, reverse-transcription of mRNA transcriptfollowed by integration into genomic DNA, presumably in the germline. Because of their origin, processed pseudogenes are sometimesconsidered a special type of retrotransposons just like Alu andlong interspersed (LINE) elements, and are sometimes referredto as retro-pseudogenes. They are typically characterized by acomplete lack of introns, the presence of small flanking directrepeats, and a polyadenine tract near the 3' end (provided thatthey have not decayed). Processed pseudogenes in general are nottranscribed, however in very rare cases, transcripts of some pseudogenehave been reported, although the functional relevance of thesepseudogene transcripts remains unclear (McCarrey et al. 1996;Fujii et al. 1999; Olsen and Schechter 1999).

It is unclear how many pseudogenes exist in the human genome. Estimates for the number of human genes range from ~22,000 to~75,000 (Crollius et al. 2000; Ewing and Green 2000; Lander etal. 2001; Venter et al. 2001; Harrison et al. 2002b). From previousreports, it is thought that up to 22% of these gene predictionsmay be pseudogenic (Lander et al. 2001; Yeh et al. 2001). It isimportant to characterize the human pseudogene population, astheir existence interferes with gene identification and annotation.They are also an important resource for the study of the evolutionof protein families, for example, studies on the human olfactoryreceptor subgenome (Glusman et al. 2001). Harrison et al. (2002a)performed a detailed analysis of pseudogenes on human chromosomes21 and 22. It was discovered that the protein family that hasthe largest number of processed pseudogenes is RPs, a total of43 of which were found on the two smallest human chromosomes.This extrapolated to over 2000 RP pseudogenes in the whole humangenome.

We have developed a pipeline of mostly automatic procedures that enables us to discover and characterize pseudogenes quicklyand comprehensively. Here we report the identification of over2400 processed RP pseudogenes and pseudogenic fragments on thelatest human genome draft sequence (Lander et al. 2001). Completesequence and precise chromosomal location have been obtained foreach pseudogene. We provide a comprehensive characterization ofthe human RP pseudogene population and discuss its implicationsfor retrotransposition and genomedynamics.

	RESULTS

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS WEB SITE REFERENCES REFERENCES

Human Genome Has 2090 RP Processed Pseudogenes

We have conducted a comprehensive search for cytosolic RP pseudogenes on the August 2001 freeze of the human genome draft(Lander et al. 2001). Details of the annotation procedure aredescribed in the Methods section, and a flow chart is shown inFigure 8A below. Table 1 shows the distribution of identifiedRP pseudogenes among 22 autosomes and two sex chromosomes, togetherwith the length of each chromosome and the number of functionalRP genes previously mapped onto it (Kenmochi et al. 1998; Uechiet al. 2001; Yoshihama et al. 2002). Some general statistics ofthe processed pseudogene population are shown in Table 2. A totalof 2090 processed RP pseudogenes were identified in the wholehuman genome. The substantial majority (1912) of these are termed"intact" pseudogenes because they are continuous in sequence withinsertions shorter than 60 bp, whereas the remaining 178 are disruptedby long insertions in the middle of their sequence. The majority(146 of 178) of these disruptions are caused by the insertionsof one or more retrotransposons, Alu, or less often, LINE elements.

View this table:
[in this window]
[in a new window]

Table 1. Number of RP Pseudogenes on Each Chromosome

View this table:
[in this window]
[in a new window]

Table 2. Overall Statistics of RP Processed Pseudogenes

358 Pseudogenic Fragments

We also found 358 pseudogenic fragments, which are continuous in sequence but produce transcripts shorter than 70% of a full-lengthRP peptide. On average these fragments match 40% of the full-lengthRPs with an average amino acid sequence identity of 74.2% (seeTable 2). There are three possible explanations for these shortfragments. (1) They could have originally been individual exonsof duplicated RP genes. (2) They could have been intact processedpseudogenes and later became truncated by spontaneous DNA deletionor retrotransposon insertion. (3) They could have been causedby premature termination of the reverse transcription process,which would lead to incomplete incorporation of cDNA into thechromosome. Because the reverse-transcription starts at the 3'end (poly-A tail), such premature truncation would tend to occurat the 5' end of the cDNA sequence. The first scenario involvesduplicated RP genes, and the last two scenarios assume a processedorigin for the pseudogenic fragments. We believe the last twoare more likely because there is evidence for both hypotheses.For most of these pseudogenic fragments, we could locate a retrotransposonwithin 300 bp on the chromosome with the average distance betweenthe fragments and the retrotransposon being 108 bp. This closeproximity strongly indicates retrotransposon insertion eventsin past evolution, which caused the RP pseudogene truncation.Also, the average truncation at the 5' end for these fragmentsis almost twofold longer than at the 3' end (227 vs. 127 bp),which is consistent with the mechanism of target-primed reversetranscription (Table 2). Based on these arguments, we countedthese pseudogenic fragments as processed when we computed pseudogenedensity (see Table 1 footnote), but in general these fragmentswere treated separately from the full-length processed pseudogenepopulation. As the total number of these fragments is much smallerthan the number of processed pseudogenes (358 vs. 2090), exclusionof them from the processed pseudogene counts does not affect theconclusions one way oranother.

Kenmochi and colleagues sequenced most of the 80 human RP genes and mapped them onto individual cytogenic bands (Kenmochiet al. 1998; Uechi et al. 2001; Yoshihama et al. 2002). In ourpresent search for processed pseudogenes, 72 of these 80 RP geneswere located and their cytogenic locations were confirmed. Inaddition, 16 duplicated copies of these RP genes were identified,mostly in the neighboring region of the original RPgenes.

Overall Statistics of the Processed Pseudogenes

Because the ribosomal proteins are of various lengths, we measure sequence completeness by defining relative length as theratio between the length of translated pseudogene and the lengthof the corresponding functional ribosomal proteins. In general,the RP pseudogenes are well preserved, as they tend to be almostfull-length in their coding regions (96.5%), with high sequenceidentity in terms of both translated amino acid sequence (76.2%)and also underlying nucleotides (86.8%). Figure 1A illustratesthe distribution of the relative sequence length of processedpseudogenes. Surprisingly, although we used 70% as a thresholdto separate the processed pseudogenes from pseudogenic fragments,the CDSs of the majority of the processed pseudogenes (>90% ofthe set) are practically full-length. It is known that LINE1 reverse-transcriptase(RT) has a low efficiency that often leads to 5' truncation andthus incomplete insertion of transcripts. It is a little surprisingthat we have observed such a high percentage of near-completepseudogenes, but it is probably because RT truncations mostlyoccurred in the 5' UTR instead of the protein-coding region. Figure1B shows the distribution of DNA sequence identity between processedpseudogenes and the RP cDNA sequences. Figure 1C shows the distributionof number of disablements (premature stop codons and frame shifts)per pseudogene, with the y-axis plotted in log scale. Of the 1912"intact" processed pseudogenes (Table 1), 258 (13%) do not containany disablements; therefore they could potentially be mistakenas functional genes by some automatic gene prediction algorithms.The graph shows an exponential relationship. A similar exponentialrelationship was observed in a smaller set of human olfactorypseudogenes (~600; Glusman et al. 2001), and was interpreted insuch a way to support an alternative origin for olfactory receptorpseudogenes other than gene duplication or retrotransposition.

View larger version (26K):
[in this window]
[in a new window]

Figure 1 RP processed pseudogenes statistics. (A) Distribution of relative sequence length among processed pseudogenes. Relative sequence length is the ratio between the length of translated pseudogene and the length of the corresponding functional ribosomal protein. (B) Distribution of the DNA sequence identity between processed pseudogenes and the cDNA sequence of functional RP proteins. (C) Distribution of number of disablements among processed pseudogenes.

We also checked the existence of a polyadenine tail for our processed pseudogene set. Of the 2090 processed pseudogenes, 952(45.5%) have no obvious polyadenine tail of at least 30 bp detected(see Methods section), 176 (8%) have both a poly-A tail and apolyadenylation signal (mostly AATAAA) within 50 bp of the poly-Atail. Thirty-two pseudogenes (1.5%) have a poly-A tail and a polyadenylationsignal 50-100 bp upstream; 903 pseudogenes (44.5%) only have apoly-A tail with no detectable polyadenylation signal. We areconfident in our assignment of processed pseudogenes; lack ofa poly-A tail for about half of the assigned processed pseudogenescan be explained as decay in genome sequence and nucleotide substitutions.Harrison et al. (2002a) found polyadenylation for only 52% ofthe processed pseudogenes on chromosomes 21 and 22, which is similarto the ratio we found here for RPpseudogenes.

Distribution of Pseudogenes Among Chromosomes

Unlike in prokaryotes, where the RP genes are organized into operons, the distribution of RP genes among human chromosomesis dispersed but not random (Feo et al. 1992; Kenmochi et al.1998; Uechi et al. 2001; Yoshihama et al. 2002). Every human chromosomeexcept chromosomes 7 and 21 contains at least one or more RP genes.Chromosome 19, one of the smallest chromosomes, contains as manyas 13 RP genes (Table 1). Such high density of RP genes on chromosome19 can be explained by the high chromosome GC content, which resultsin unusual high gene density (Mouchiroud et al. 1991; Lander etal. 2001; Venter et al. 2001). The distribution of processed RPpseudogenes in the human genome appears more random and uniformthan their functional counterparts (Fig. 2). It is obvious thatthe abundance of processed pseudogenes on each chromosome is proportionalto the chromosome length (Fig. 3A), with a correlation coefficientof 0.89 (P<1E-8). Including pseudogenic fragments in the set hasno noticeable effect on this result.

View larger version (19K):
[in this window]
[in a new window]

Figure 2 The human RP processed pseudogene population. Twenty-four human chromosomes are shown vertically from left to right. Pseudogenes are represented as short blue horizontal bars; long thick red horizontal bars delimit centromere region. Red dots represent chromosome ends.

View larger version (18K):
[in this window]
[in a new window]

Figure 3 (A) Correlation between chromosome length and number of processed RP pseudogenes on them. Each black-diamond

symbol represents a chromosome. The correlation between number of processed pseudogenes on each chromosome and chromosome length is 0.89, P<1E-8. (B) Processed pseudogene density on each chromosome is correlated with the chromosome GC content. The correlation coefficient is 0.51, P<0.01.

We further calculated the RP pseudogene density (number of pseudogenes per Mb) for each chromosome and plotted them againstchromosomal GC content (Fig. 3B), which shows a weak positivecorrelation (correlation coefficient = 0.51, P<0.01). The outlieron the bottom of the graph is the sex chromosome Y, which hasthe lowest pseudogene density even for its relatively low GC content.Chromosome Y is unusual in many ways, as it also has the lowestdensity for Alu repeats (Lander et al. 2001); those authors suggestedthat these phenomena might be related to the high tolerance forDNA insertion and deletion and rapid gene turnover rate on thischromosome. If we weight the chromosome length by its GC content,then the correlation with the pseudogene density increases from0.89 to 0.91 (P<1E-9). It is likely that the chromosomal GC contentreflects the relative stability of the chromosome; that is, pseudogenesare more likely to be preserved on the chromosomes that have aslower gene turnoverrate.

Genomic Distribution of Processed Pseudogenes

Using a 100-Kb-long nonoverlapping window, we divided the human genome into more than 30,000 segments and assigned them tofive classes according to their average GC content. For each class,we also calculated the gene or pseudogene density by dividingthe number of genes or pseudogenes by the amount of DNA in thatclass (Table 3). It is well established that in the human genome,gene density is strongly correlated with local GC content, withthe GC-rich regions being mostly gene-dense (Mouchiroud et al.1991; Lander et al. 2001; Venter et al. 2001). This is clearlythe case for functional RP genes, as the GC-rich classes (>46%)contain the majority of the RP genes and have higher RP gene density.In contrast, the RP pseudogenes are enriched in classes with lowerGC content; they have the highest density in the genomic regionwith intermediate GC content (41-46%). In fact, the class thathas the highest local GC content (>52%) contains the fewest numberof pseudogenes, although it has the highest RP gene density. Similargenomic distributions have been reported for chromosome 22 witha smaller set of 114 pseudogenes (Pavlicek et al. 2001). Our resultssuggest that this is probably a general rule for all processedpseudogenes in the human genome.

View this table:
[in this window]
[in a new window]

Table 3. Genomic Distribution of RP Processed Pseudogenes

It has been proposed that the protein machinery encoded by the LINE1 element is involved in the arising of both the Alu repeatsand LINE repeats (Feng et al. 1996; Jurka 1997; Weiner 1999) andthe processed pseudogenes (Weiner 1999; Esnault et al. 2000).LINEs and Alus are the most frequent retrotransposons found inthe human genome, each occupying about 15% and 10% of the genomerespectively. LINEs (long interspersed elements) are about 6-kblong and encode two open reading frames (ORFs). Alus are a majorclass of SINEs (short interspersed elements), approximately 280bp in length. Despite their common origin, the Alus in the humangenome are predominantly found in GC-rich regions, whereas LINEsand processed pseudogenes are more prevalent in relatively GC-poorregions. In this sense, the distribution of Alus is more similarto that of genes than pseudogenes. In Figure 4A, we plotted theRP pseudogene density along with the densities of functional RPgenes, Alus, and LINEs. [The data for Alus and LINEs are fromthe results of Pavlicek et al. (2001)]. It is obvious that boththe functional RP genes and the Alus are enriched in the GC-richregions and depleted in the GC-poor regions. LINEs are predominantlyfound in genomic regions with the lowest local GC content. Thedistribution of RP pseudogenes falls between these extremes, asthey have the highest density in the regions with intermediateGC content (41%-46%).

View larger version (18K):
[in this window]
[in a new window]

Figure 4 (A) Distribution of Alu elements, LINE elements, processed RP pseudogenes, and functional RP genes among genomic regions of different GC content. Because of their different abundance in genome, these four species are plotted on different scales: number per 10Kb for Alus and LINEs, number per Mb for RP pseudogenes, and number per 100 Mb for functional RP genes. (B) The drift in GC content for RP processed pseudogenes. ( black-diamond

) The GC content of functional RP gene coding sequence (CDS). ( black-square

) The GC content of processed pseudogenes. The vertical bars are standard errors.

Negative Selection Theory

The puzzling contrast between the genomic distribution of Alus and LINEs was recently explained by comparing the distributionof repeats of different age groups (Lander et al. 2001; Pavliceket al. 2001). It has been observed that young Alus, similar toLINEs, were more frequently found in the GC-poor region comparedto the more ancient Alu elements. Based on such findings, Pavliceket al. (2001) proposed a negative selection theory, which hypothesizedthat the enrichment of Alus in the GC-rich region was the resultof their higher stability in the compositionally matching environment.It is believed that when the retrotransposons were first integratedinto the nuclear genome, both Alus and LINEs preferred a GC-poor(AT-rich) region because the LINE1 reverse-transcriptase/endonucleasespecifically targets the TT|AAA insertion site. Because of theconspicuously higher GC content of Alus (~57%), their existencein GC-poor regions would destabilize the chromosome. Therefore,these Alus would be selected against to be either lost or, perhapsmore likely, their nucleotide composition would have drifted towardsa lower GC level and decayed into background genomic DNA and becomeunrecognizable.

We believe that the aforementioned negative selection theory can also explain the pseudogene density distribution illustratedin Figure 4A. The GC content of RP CDS ranges from 42% to 63%with the median at 51%, which is not as high as Alus, but stillmuch higher than the LINE repeats (~42%) and the genome-wide average(~41%). The average GC content for the RP pseudogene sequencesis 47%, which is intermediate between those of the functionalRP genes and genomic DNA. Therefore, at least for RP pseudogenes,we have observed the drift in their GC content, which supportsthe negative selection hypothesis. We further divided RP processedpseudogenes into four groups according to the average GC contentin the 100-Kb genomic region surrounding each pseudogene. Foreach group, we calculated the average GC content for both thepseudogene sequences and also the CDS of the functional RP genesthey originated from. The results are plotted in Figure 4B, whichclearly shows a greater drift for pseudogenes in the GC-poor regionthan in the GC-rich region; therefore, the pseudogenes in GC-poorregion appear more decayed than those in the GC-rich region. Suchdrift in nucleotide composition was previously reported for silentmutation sites in mammalian MHC gene sequences (Eyre-Walker 1999)and interspersed repeats in the human genome (Lander et al. 2001).In both studies, significantly more single nucleotide substitutionsfrom G/C to A/T than from A/T to G/C have been observed. Despitethe drift in composition, the majority of the processed RP pseudogenesstill have GC content higher than their surrounding genomicsequences.

Age Distribution of Processed Pseudogenes

When mRNA transcripts were reverse-transcribed to become pseudogenes, they were immediately released from selection pressure.Therefore the amount of mutations they accumulated during evolutioncould be used to infer their ages. Because mammalian RP sequenceshave stayed almost unchanged since rodents and primates divergedover 100 millions of years (Myr) ago (99% sequence identity betweenrats and human), we can safely use the present-day human RP sequenceas the ancient RP gene sequences to calculate the divergence ratefor the processed pseudogenes. The percentage of sequence divergencewas converted into approximate age in Myr by using a constantsubstitution rate of 1.5 × 10⁹ per site per year (Li 1997). It is known that substitution ratevaries during evolution (Goodman et al. 1998; Lander et al. 2001);however we believe that such simplified treatment is sufficientfor ourpurpose.

The age distribution of human repetitive sequences has been analyzed (Smit 1999; Lander et al. 2001). Figure 5 shows the distributionof sequence divergences for RP pseudogenes together with LINE1and Alu repeats; each increment in divergence represents roughly6.7 Myr. The repeats data are from Arian Smit (pers. comm.). Itis obvious that processed pseudogenes have an age distributionmuch more similar to Alu elements than to LINE1 elements, althoughthey were all processed by the same LINE1 machinery. Note thatLINE1s are mammalian-specific and Alus are primate-specific. Thedistribution for RP pseudogenes peaks at an evolutionary age correspondingto 8%-10% sequence divergence, whereas Alus peak at 7% and LINE1elements peak at both 4% and 21%. Interestingly, RP pseudogenesalso have a shoulder at 17%-18%, which could have been the consequenceof the surge of LINE1 retrotransposition activity just a few millionyears before that. The rate of new processed pseudogenes generatedin the human genome has slowed down since ~40 Myr ago, which wasabout the time when human species diverged from gibbons. Thiscoincides with the decline of new LINE1 elements and Alus in thegenome. It has been proposed that the structure and dynamics ofhominid populations are responsible for such decline in retrotransposonactivity (Lander et al. 2001).

View larger version (33K):
[in this window]
[in a new window]

Figure 5 Distribution of sequence divergence for RP processed pseudogenes in comparison with Alu and LINE1 repeats. Pseudogenes and repeats were grouped into bins according to their sequence divergence from consensus sequences. Each increment in divergence represents roughly 6.6 million years (Myr). The LINE and Alu data are from A. Smit (pers. comm.).

GC-Poor RP Genes Have More Processed Pseudogenes

Table 4 lists the number of processed pseudogenes among 79 RPs, sorted in the descending order. The first two columns listthe SWISSPROT ID (Bairoch and Apweiler 2000) for the human RPs,and the standard mammalian RP gene nomenclature (Mager et al.1997). Also listed are the lengths of RP mRNA transcripts, codingsequence (CDS), and the CDS GC content, all retrieved from GenBank.On average, 26 processed pseudogenes are found for each RP gene;however, different RP genes have clearly very different propensitiesfor generating processed pseudogenes. The distribution of numbersof processed pseudogenes among RP genes is strikingly skewed,although presumably for each RP only one functional gene exists(Wool et al. 1995). RPL21 has the most copies of processed pseudogenesat 145, which is about 50% more than that of RPL23A, which hasthe second-most at 85. Meanwhile, 24 RP genes have less than tencopies of processed pseudogenes each, and MRPL14 has the fewestat three. Regarding the RP genes that have the greatest numbersof processed pseudogenes, we also checked their chromosomal locationsto make sure that they were not created from genomic duplication;that is, these processed pseudogenes arose mostly independently.

View this table:
[in this window]
[in a new window]

Table 4. Distributions of Processed Pseudogenes Among RP Genes

We were curious as to whether the differing processed pseudogene abundance among RP genes is correlated with the recent declinein retrotransposition activity. We further divided the processedpseudogenes originated from the same RP gene into three groupsaccording to their ages: <40 Myr, 40-80 Myr, and >80 Myr (Fig.6A). It is obvious that the age distribution of processed pseudogenesis similar for all 79 RP genes, that is, there were no preferencesfor a certain group of RP genes in different evolution periods.The correlation between the number of young pseudogenes (<40 Myr)and number of mid-age pseudogenes (40-80 Myr) per RP gene is 0.73(P<1E-13); the correlation between mid-age pseudogenes and oldpseudogenes (>80 Myr) is 0.68 (P<1E-11).

View larger version (43K):
[in this window]
[in a new window]

Figure 6 (A) Distribution of processed pseudogenes among RP genes. Bars of different shades represent different age groups. (B) Lack of correlation between mRNA transcript length and number of processed pseudogenes. The pseudogenes are grouped into bins according to the length of their mRNA transcripts. Vertical bars are standard errors. (C) Significant inverse correlation between GC content of RP gene coding sequence (CDS) and number of processed pseudogenes for that RP. The RP genes are grouped into four bins according to their CDS GC content.

It is also plausible that the differences in pseudogene abundance merely reflect the different ages for individual RP genes,as presumably genes that have been around longer will have morechance being reverse-transcribed to generate pseudogenes. To checkthis, we grouped RP genes into three groups according to theirphylogenetic profile, that is, some RP genes are unique to eukaryoteswhile others have homologs in eubacterial and archaebacterialkingdoms (Wool et al. 1995). There appears to be no correlationbetween processed pseudogene abundance and the degree of ubiquity.Within eukaryotes, we also looked at the sequence identity betweenyeast RPs and human RPs; no correlation was found there as well.The pseudogene abundance also has no correlation with the extra-ribosomalfunction of some of the RP genes (Wool 1996).

Goncalves et al. (2000) analyzed 249 processed pseudogenes, which correspond to 181 functional genes, and concluded that humangenes that gave rise to processed pseudogenes in general sharefour features. They are (1) widely expressed, especially in germline, (2) highly conserved, (3) short, and (4) GC-poor. The firsttwo criteria are trivial for ribosomal proteins, as RPs are ubiquitousin all cell types, and they are also the most highly conservedamong eukaryotes and mammals (Wool et al. 1995). In general, RPgenes have short mRNAs and short CDS as seen in Table 4, althoughthere is no significant correlation between the number of processedpseudogenes and the mRNA length (correlation 0.01, P<0.93) (Fig.6B) or the CDS length (correlation 0.04, P<0.73). We would liketo emphasize the lack of obvious correlation between gene lengthand pseudogene abundance, as it demonstrates that our pseudogenesearching procedure did not systematically miss out short pseudogenes;that is, the skewed pseudogene distribution is not an artifact.However, there is a significant inverse correlation between thenumber of processed pseudogenes and the GC-content of RP geneCDS (correlation 0.41, P<0.0002) as shown in Figure 6C; thatis, relatively GC-poorer RP genes tend to have more processedpseudogenes than GC-richer ones. It is not immediately obviouswhat is the mechanism behind the enrichment for the relativelyGC-poor RP genes, since the arising of a processed pseudogeneinvolves multiple steps and the selection for GC-poor RP genescould have occurred at any step along the way. More on this topicwill be discussed in the Discussionsection.

Nonprocessed Pseudogenes and Duplicated RP Genes

We found only 16 duplicated RP genes in the human genome (Table 5), which share identical exon structure with previouslycharacterized RP genes (Kenmochi et al. 1998; Uechi et al. 2001).This is in sharp contrast to the yeast genome, where most RP genesare duplicated and the duplicated genes are also transcribed andfunctional. Only one duplicated gene in the human genome (RPL13A)has an obvious disablement in the coding region; it is possiblethat other duplicated RP genes may have hard-to-detect disablementsin the UTR regions or introns. It is not clear whether these duplicatedRP genes are transcribed in the cell, although it is generallyassumed that only one gene is functional for each ribosomal protein(Wool et al. 1995; Kenmochi et al. 1998). The majority of theduplicated genes are in the vicinity of the original genes, andtherefore could not have been resolved from the original genesin the hybridization experiments. There are notable exceptions:RPL26, RPS27, and RPL3 have duplicated copies on separate chromosomes,and RPS4Y has a duplicated copy on the opposite end of chromosomeY. Interestingly, the duplicated copies for RPL26, RPS27, andRPL3 genes have much longer introns than the mapped genes, whichwere caused by insertion of Alu or LINE repeats (with the exceptionof RPS27). It is likely that the sequence difference in intronregion is the reason that they were missed out in the hybridizationexperiments, even though they are far apart from the mapped RPgenes. Detailed analysis of these duplicated genes will be describedin subsequent reports.

View this table:
[in this window]
[in a new window]

Table 5. Duplicated Human RP Genes

Our homology matching procedure located at least one intron-containing functional gene for all but eight RP genes: RPP2, RPL4,RPL30, RPL35A, RPL38, RPL41, RPS7, and RPS27A. We did, however,find processed pseudogenes for these RP genes in the genome. Thesegenes either consist of short exons or their protein sequencesare predominantly low-complexity, making them difficult to findby homologymatching.

It was surprising to discover a processed RPL26 pseudogene in the intron region of the functional RPS2 gene on chromosome16 (band p13.3, Contig AC005363.1.1.75108, Ensembl ID ENSG00000140988).RPS2 gene has seven exons; the pseudogene resides in the thirdintron (1015 bp long), between residues 89 and 90 in the RPS2protein sequence. Interestingly, there is also an Alu elementat the 3' end of the pseudogene, about 100 bp away. The pseudogeneitself is 357 bp long, corresponding to residues 14 to 141 ofRPL26, having amino acid sequence identity of 49% and nucleotidesequence identity of 73% (Fig. 7). It appears to be very ancient,has already lost its poly-A tail, and has sequence divergenceof 0.28, which corresponds to more than 100 Myr old. Figure 7shows the alignment of RPL26 sequences from several eukaryoticorganisms together with this pseudogene. At 11 positions, thepseudogene has the same residue with the mammalian sequences butnot with the invertebrates. Note that rat and human sequencesare almost identical except at residue 100, where rat has an arginineand human has a histidine. Interestingly, this RPL26 pseudogenealso has a Histidine at that position; this suggests that thepseudogene became part of the intron before the divergence ofrodent and hominid species. It has been known that some RP genescontain Alu or LINE elements in the 3' or 5' UTR; to our knowledgethis is the first case where a processed pseudogene is found inthe intron region of another functional gene. This has implicationsfor the origin and evolution of introns.

View larger version (49K):
[in this window]
[in a new window]

Figure 7 Amino acid sequence alignment of RPL26 genes from yeast, worm, fruit fly, rat, and human, and a processed pseudogene (chr16_RL26_5) found in the intron region of the human functional RPS2 gene. The residues highlighted in gray are those present in the pseudogene and also in both the mammalian and invertebrate proteins; the residues outlined in bold are those present in the pseudogene and the mammals but not in invertebrates. In the pseudogene sequence, * represents a stop codon, and an underscored amino acid indicates an adjacent frame shift. Rat and human RPL26 have almost identical sequences except at position 100, where the rat protein and the pseudogene have an Arginine and human protein has a Histidine.

Online Database

The data and results discussed in this report can be accessed online at http://www.pseudogene.org/ or http://bioinfo.mbb.yale.edu/genome/pseudogene/.

	DISCUSSION

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS WEB SITE REFERENCES REFERENCES

Significance of RP Pseudogenes

Characterizing ribosomal protein pseudogenes is valuable in many ways. (1) It will be tremendously useful in the study offunctional RP genes. RP genes are implicated in many human geneticdiseases such as Diamond-Blackfan anemia (Draptchinskaia et al.1999), Noonan syndrome (Kenmochi et al. 2000), and Turner`s syndrome(Zinn et al. 1993). The precise nucleotide sequence and chromosomallocation of RP pseudogenes will certainly help researchers indesigning probes specific to functional genes. (2) Pseudogenescan also serve as genomic milestones, as they provide snapshotsof RP sequences existing millions of years back in evolution.Such information will be valuable in studying ribosome biogenesisand the phylogenetic relationships between organisms. The discoveryof an RPL26 pseudogene in the intron region of a functional RPS2gene could certainly shed light on the evolution of both RP genes.(3) From the perspective of studying retrotransposition, processedpseudogenes are just a special type of repetitive elements likeAlus. However, processed pseudogenes are much more diverse interms of sequence length, GC content, and other features thantraditional retrotransposons, which makes them useful in studyingevolution and dynamics of genomes. To our knowledge, our RP pseudogenesare the largest set everstudied.

Comparing With Ensembl Annotations

The Ensembl database (http://www.ensembl.org/) is an automated system for genome-wide gene prediction and annotation, whichhas direct links to primary HGP data sources (Birney et al. 2001;Hubbard et al. 2002). The annotation process relies on matchinggenomic DNA sequence and GenScan peptides (Burge and Karlin 1997)with known proteins, mRNAs, and other sequence information. Allof the genes were checked to be transcribed before they were includedinto the database (Daniel Barker, pers. comm.). As of the endof February 2002, there were approximately 47,000 annotated genesin Ensembl, of which 549 were annotated as ribosomal protein genes.Some of these have more detailed annotations associating themwith a particular RP such as "60S RIBOSOMAL PROTEIN L7", and otherswere described more loosely such as "60S RIBOSOMAL PROTEIN". Afterre-aligning these genes with human RP protein sequences and removingsome dubious matches, we derived a set of 481 Ensembl RPentries.

Ensembl does not explicitly differentiate between functional genes and pseudogenes, nor does it aim to (D. Barker, pers. comm.).Consequently, most of these 481 Ensembl RP entries turned outto be pseudogenes instead of functional genes, as only 260 (54%)translate to peptides longer than 95% of full-length ribosomalproteins. For instance, a gene ENSG00000150624 on chromosome 2was annotated as "60S RIBOSOMAL PROTEIN L17", but produced a transcriptthat was only 51.6% of the full-length RPL17, and had sequenceidentity of 56.2%. Moreover, only 170 of these genes have introns;most of these Ensembl RP genes (64.6%) are single exons. We checkedthe overlap between our RP pseudogene sets with these EnsemblRP entries: 474 of 481 (98.5%) Ensembl RP entries have significantoverlaps with our pseudogenes, and in most cases our pseudogeneswere longer than the Ensembl entries. Five RPL41 single-exon processedpseudogenes from Ensembl were the only ones missed by our procedure.The RPL41 is the shortest ribosomal protein, with only 25 aminoacids; it also contains 17 near-consecutive Arginine and Lysineresidues. It is likely that short length and low complexity causedBLAST to fail to detect these pseudogenes. Note that Ensembl isa database in flux, that is, the sequence and annotation are continuouslyupdated and improved. Therefore some of the examples and statisticsgiven above will probably be out of date when this report is published.Nonetheless, the overlap in annotation of genes and pseudogenesdocumented above is important as it demonstrates the need to systematicallyinclude pseudogene identification in genome annotationefforts.

Automatic gene prediction programs alone do not have the ability to differentiate between functional genes and pseudogenes,especially if the pseudogenes do not contain obvious disablementsin the coding sequence (CDS). Furthermore, for those pseudogenesthat contain disablements, gene prediction programs either discardthem or stop at the disablement and predict the pseudogene asa functional gene but with truncated length. We think this isthe reason that so many RP pseudogenes were passed into the Ensembldatabase as functional genes. The number of genes in the humangenome has long been a matter of debate, as different methodssuch as EST analysis and GenScan (Burge and Karlin 1997) gavedifferent estimates (Harrison et al. 2002b). It is probably notappropriate to extrapolate the overestimation for RP genes ontothe whole human proteome, as ribosomal proteins are a very uniqueprotein family in many ways. Nevertheless, special care shouldbe taken in interpreting outputs from automatic gene predictionprograms.

Pseudogene Abundance per RP Cannot Be Explained by Positive Selection

As mentioned previously, we found an inverse correlation between RP gene GC content and the pseudogene abundance for thatgene (Fig. 6C); that is, the relatively GC-poor RP genes tendto have more processed pseudogenes. Before we further discussthe possible mechanism behind this correlation, it would be wellto give a brief overview of the LINE1-mediated retrotranspositionprocess, which is believed to be responsible for generating processedpseudogenes (Kazazian and Moran 1998). LINE1-mediated retrotranspositioncan be divided into four steps. (1) First, a retrotransposon orgene is transcribed in the nucleus to produce an mRNA transcript.(2) Second, the mRNA transcripts are transported into cytoplasm,and LINE1 mRNA transcripts are translated into two proteins: ORF1(also known as p40), and ORF2, which is a reverse-transcriptase/endonuclease.(3) Human ORF1 has been demonstrated to be a sequence-specificsingle-strand RNA binding protein, which binds specifically butnot exclusively to LINE1 transcript to form a ribonucleoproteinparticle (RNP) which also includes ORF2 protein (Leibold et al.1990; Martin 1991; Hohjoh and Singer 1996, 1997b; Moran et al.1996; Kazazian and Moran 1998). (4) Lastly, the RNP particle migratesinto the nucleus and undergoes target-primed reverse-transcription,which give rise to a new retrotransposon or processedpseudogene.

If the GC-poor RP genes were selected favorably in retrotransposition (i.e., there is a positive selection for them), it musthave occurred in one of the four steps described above. However,we cannot find any evidence for such positive selection in anyof the steps. In relation to step 1, we have compared the processedpseudogene abundance per gene with the mRNA expression level inhuman and yeast cells (see Methods). No significant correlationbetween the datasets was found, suggesting that the selectioncould not have occurred at the step of gene transcription. Inrelation to step 2, the lack of correlation between mRNA lengthand pseudogene abundance also suggested that the transportationof RP transcript in and out of the nucleus had no effect on retrotransposition.This is based on the idea that longer mRNAs are harder to transport.In relation to step 3, the forming of RNP particle, it has beendemonstrated that the binding between ORF1 and mRNA transcripthas a cis-preference; that is, ORF1 has higher affinity to wild-typeLINE1 transcripts that encode it. However at a much lower level,ORF1 or ORF1 and ORF2 together can also act in trans to retrotransposemutant LINEs and other mRNA transcripts (Hohjoh and Singer 1997a,b;Esnault et al. 2000; Wei et al. 2001). It is not clear what sequenceor structural features on the mRNA transcripts constitute thecis and trans preference, though it is unlikely that the overallGC content is the deciding factor, because Alu elements and LINEelements, the two most populous retrotransposons in human genome,have very different GC content (56.8% for Alus and 42.3% for LINEs).Following the same reasoning, it is also unlikely that the reversetranscription in the fourth step has a preference for GC-poortranscripts.

Negative Selection for GC-Poor RP Genes in Retrotransposition

In the above analysis we found no evidence of a positive selection mechanism in retrotransposition of GC-poor RP genes; however,a negative selection mechanism can readily explain the skeweddistribution. In this mechanism, the accumulation of GC-poor RPpseudogenes can be interpreted as the indirect result of a fasterdecay rate for GC-rich RP pseudogenes in the GC-poor genome regionwhere they were originallyinserted.

Analogous to the mechanism of enrichment of Alu elements in the GC-rich region, which we described earlier in this report,the existence of GC-rich RP pseudogenes in the GC-poor genomicregion was more unfavorable than GC-poor RP pseudogenes. Thusthere would be greater selection pressure against these GC-richpseudogenes. Pavlicek et al. (2001) divided Alu and LINE elementsinto different age groups and studied their distribution in genomeregions of different GC content. They showed that the young Alus(divergence <2% from consensus sequence) are indeed less depletedin the GC-poor region. This effect is not evident for older Alus(sequence divergence >4%). We did a similar age segmentation analysison RP pseudogenes, with the results shown in Table 6. (The numbersin the table were not normalized by amount of DNA.) We found differentresults for young pseudogenes than described above for young Alus.For young pseudogenes, there is no indication of enrichment inthe GC-poor region (where "young" here is defined as sequencedivergence less than 2% from their parents, the same cutoff asused in the study of the Alus). Note, however, that there is aslight enrichment for the youngest pseudogenes, which have sequencedivergence less than 1%, corresponding to roughly 6.7 Myr old.We think that the reason we did not observe the same behaviorfor young pseudogenes as for young Alus is because of the muchsmaller sample size for pseudogenes. In addition, the recent declinein retrotransposition activity in the human genome (Fig. 5; Landeret al. 2001) could have further complicated the situation, asfewer fresh pseudogenes were generated in the human genome.

View this table:
[in this window]
[in a new window]

Table 6. Genomic Distributions of RP Pseudogenes of Different Ages

In conclusion, the precise mechanism behind the negative correlation between gene GC content and processed pseudogene abundanceremains unsettled until more pseudogene sequences from other proteinfamilies are available. As of this writing, based on the analysisof Alu elements and the elimination of positive selection mechanismsfor RP pseudogenes, the negative selection mechanism appearsattractive.

	METHODS

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS WEB SITE REFERENCES REFERENCES

Six-Frame BLAST Search for Raw Fragment Homologies

Figure 8A is a flow chart describing our basic procedure for finding RP pseudogenes. We used the August 6, 2001 freeze ofthe human genome draft, downloaded from the Ensembl Web site (http://www.ensembl.org/).Subsequently, all of the chromosomal coordinates were based onthese sequences. The amino acid sequences of the 79 ribosomalproteins were extracted from SWISSPROT (Bairoch and Apweiler 2000).Because the sequence identity between the two RPS4 isoforms (RS4_HUMANand RS4Y_HUMAN) is very high (91%), only protein RS4_HUMAN wasused in the BLAST search. Each human chromosome was split intosmaller overlapping chunks of 5.1 million bp, and the tblastnprogram of the BLAST package 2.0 (Altschul et al. 1997) was runon these sequences. The genome sequence was not repeat-masked(A. Smit and P. Green, unpubl.) because we were concerned thatsome of the RP pseudogenes may reside in repetitive regions. DefaultSEG (Wootton and Federhen 1993) low-complexity filter parameters(12 2.2 2.5) were used in the homology search. We then pickedthe significant homology matches (e-value <1E-4), and reducedthem for mutual overlap by selecting the matches in decreasingorder of significance and removing any matches that overlap substantiallywith a picked match (i.e., more than ten amino acids or 30 basepairs).

View larger version (29K):
[in this window]
[in a new window]

Figure 8 (A) Flow chart of the procedure for searching for RP pseudogenes in the human genome. RP and Psi

G denote "ribosomal protein" and "pseudogene", respectively. S-W., "Smith-Waterman". The steps are as follows: (1) Six-frame BLAST run searching for RP homologies in the human genome. (2) Merging and extension. BLAST hits were merged and extended on both sides to match the length of RP peptide sequence. (3) Smith-Waterman realignment. Extended homologies were realigned with RP sequence. (4) Comparison with Ensembl annotation. Five RPL41 pseudogenes from Ensembl were added to the set. A total of 2536 PR genes or pseudogenes were identified. (5) Checking for long gaps. Homology sequences that contained gaps shorter than 60 bp were labeled "intact processed pseudogenes" if they were longer than 70% of the full-length RP sequence; otherwise they were labeled "pseudogenic fragments". (6) Comparison with GenBank and cytogenic mappings. For those RP homologies that contained long gaps (>60 bp), their sequences were compared with the RP exon structure from GenBank and their chromosomal locations were checked with cytogenic mapping. The homology sequences were assigned as functional RP genes, duplicated RP genes, and "disrupted processed pseudogenes." The latter were processed pseudogenes whose sequences were interrupted by retrotransposons. (B) Schematic graph describing the considerations in merging two adjacent RP matches, M1 and M2. (c₁₁, c₁₂) and (c₂₁, c₂₂) are chromosomal coordinates for M1 and M2. (q₁₁, q₁₂) and (q₂₁, q₂₂) are corresponding regions on the query RP protein that they match.

Merging Adjacent Fragment Homologies Into Single RP Matches

After sorting the BLAST matches according to their starting coordinates on the chromosomes, we found many neighboring matcheson the same chromosome that match the same RP. Some of these adjacentmatches obviously were separate genes or pseudogenes, whereasothers appeared to be part of the same gene or pseudogene. A two-stepprocedure was developed to determine (1) whether the neighboringmatches belong to the same gene structure and (2) whether theyshould be merged together into a longer homologymatch.

Step (1): Consider two adjacent homology fragments, M1 and M2, which are on the same chromosomal strand and match the sameRP (Fig. 8B). M1 has chromosomal coordinates (c₁₁, c₁₂) and matchesamino acid sequence (q₁₁, q₁₂) on the query RP protein. Similarly,M2 has chromosomal coordinates (c₂₁, c₂₂) and matches amino acids(q₂₁, q₂₂) on the query protein. By convention, q₂₁ is alwaysgreater than q₁₁ and c₂₁ is always greater than c₁₂. If M1 andM2 satisfy the following two criteria, then we decide they belongto the same gene structure; that is, they are either two exonsof the same gene or two fragments of the same pseudogene interruptedbyinsertions.

(1) | q₂₁q₁₂ | max (20, 0.2xL) and (2) c₂₁c₁₂ 5000 (L denotes the length of the query RP peptide sequence). The reasoningbehind criterion (1) is that if the two homology fragments havetoo much overlap or have too long a gap between them on the queryprotein sequence, then they should be considered two separateand independent matches. Criterion (2) sets the maximum lengthof insertions in the middle of a pseudogene. We checked that theintrons in the RP genes are all shorter than 5000 bp, so we wouldnot have accidentally split a gene intotwo.

Step (2): If two homology fragments are determined to be part of the same gene or pseudogene structure in step (1), then instep (2) the fragments were merged only if the chromosomal distancebetween the matches was shorter than 60 bp; that is, c₂₂c₂₁ 60.The rationale behind such treatment was that if the gap betweenthe matches were too long, then merging them together would generateerrors in the Smith-Waterman realignment procedure described below.In addition, it has been shown that more than 95% of the intronsin human are longer than 60 bp (Lander et al. 2001), and thuswe would not have accidentally merged two exons together or includedintrons into the codingsequence.

Optimization From Smith-Waterman Alignment of Merged Matches

After merging, each match was extended on both sides to equal the length of the RP they matched, plus a buffer of 30 bp. Foreach extended match, the corresponding SWISSPROT protein sequenceswere then realigned to the genomic DNA sequence following theSmith-Waterman algorithm (Smith and Waterman 1981) by using theprogram FASTA (Pearson 1997). The reason for such an extensionprocedure is that BLAST may have skipped low-complexity segmentsin the query RP sequence; also, BLAST does not recognize frameshifts. After the realignment, the matches are "cleaned up": anyredundant matches were removed, and matches that contain gapslonger than 60 bp were split up into two individual matches. Becausesequence alignment programs sometimes tend to pick up some extraresidues at the ends of the alignment, each alignment was filteredto remove dubious matches at the ends. At this step, we had atotal of 2531 pseudogene candidates in the whole genome that matchedthe human RPs. Most of these were potential pseudogenes, but therecould also be real functional RP genes in this set, because wedid not exclude any matches based ondisablement.

Deriving a Set of RP Genes From the Ensembl Database

We wanted to compare our pseudogene sets with the RP genes from the Ensembl database (http://www.ensembl.org/; Birney et al.2001; Hubbard et al. 2002). As of the end of February 2002, therewere approximately 47,000 confirmed genes, each with an annotatedfunction. (Details regarding the Ensembl annotation procedurecan be found in the aforementioned references.) We searched theEnsembl database and picked out 549 genes that have been annotatedas ribosomal proteins. We then reannotated these genes by aligningthem pairwise with human RP protein sequences, and picked outthose Ensembl genes that had FASTA e-values lower than 0.0001.After removing a few remaining mitochondrial ribosomal proteingenes, we had a set of 481 Ensembl nuclear RPgenes.

In our examination of these Ensembl RP entries, it became obvious that most of these were pseudogenes other than real functionalRP genes, because they do not contain introns. We found that 474(98.5%) of the 481 Ensembl RP genes have significant overlapswith our pseudogene sets. Five single-exon RPL41 pseudogenes fromEnsembl were added to our pseudogenesets.

Assessing for Processing by Checking for Exon Structures

We divided our pseudogene population into two subsets based on whether they contained long gaps in the middle of the sequence(Fig. 8A). We labeled those pseudogenes as "processed" if theymet two criteria: (1) they contained gaps of shorter than 60 bp,that is, c₂₁c₁₂ 60 in Figure 8B, and (2) they produced transcriptslonger than 70% of the ribosomal protein they matched. Venteret al. (2001) also used the last criterion. We also checked inGenBank that all 79 ribosomal protein genes contain introns longerthan 60 bp. The remaining single-exon pseudogenes, which are shorterthan 70% of the full-length protein, were labeled "fragments".A total of 1912 "intact" processed pseudogenes and 358 pseudogenicfragments were identified at thisstep.

For those pseudogene candidates that contained multiple segments separated by gaps longer than 60 bp (total of 266), it wasnot straightforward to determine whether they were of processedor nonprocessed origin because the gaps could be either intronsor repeat insertions. It is also likely that there were real functionalribosomal protein genes in this group. The cytogenetic locationsof the 80 human RP genes (including the isoform gene RPS4Y onchromosome Y) were previously mapped (Kenmochi et al. 1998; Uechiet al. 2001; Yoshihama et al. 2002). Using the cytogenetic mapas reference and comparing the position of the gaps in the sequencewith the exon structure of the functional RP genes, we identified72 functional RP genes and 16 duplicated genes, and assigned theremaining 178 as "disrupted" processed pseudogenes. In summary,at the end of this process we had 2090 processed pseudogenes,358 pseudogenic fragments, 72 functional RP genes, and 16 duplicatedRPgenes.

Further Verification of Processing by Poly-A Signal

When processed pseudogenes were integrated into genome from mRNA, a polyadenine tail at the 3' end would also be included(Vanin 1985; Mighell et al. 2000). This polyadenine tail is atleast 15-20 nucleotides long and is preceded by a polyadenylationsignal (mostly AATAAA; Wool et al. 1995). We were interested tosurvey how many of the ribosomal pseudogenes still had the polyadeninetail. Following the procedure described by Harrison et al. (2001),we searched a 1000-bp region that was 3' to the pseudogene homologysegment, with a sliding window of 50 nucleotides for a regionof elevated polyadenine content (>30 bp), and picked the mostadenine-rich 50-bp segment as the most likely candidate. An intervalof 1000 nucleotides was used because of the possible existenceof 3'-untranslated regions (3'-UTRs); 90% of 3'-UTRs are of lengthless than 942 bp (Makalowski et al. 1996). In addition, we searchedin the same 1000-bp region for candidate AATAAA or other polyadenylationsignals and checked whether they were upstream of the candidatepolyadenine tailsite.

Dating Processed Pseudogenes

Processed pseudogene sequences are aligned together with the corresponding functional RP gene sequences using program ClustalW(Thompson et al. 1994). For each pseudogene, we calculated sequencedivergence from the present-day RP gene with the program MEGA2(Kumar et al. 2001), using the Kimura two-parameter model andpairwise deletion. Kimura's two-parameter model (Kimura 1980)corrects for transitional and transversional substitution rateswhile assuming that the four nucleotide frequencies are the sameand rates of substitution do not vary among sites. Evolutionaryages were calculated by the formula T = D/k, where D is the correcteddivergence rate and k is the mutation rate per year per site fornonfunctional sequences. A mutation rate of 1.5 × 10^-9 per site per year (Li 1997) wasused.

Calculating Pseudogene Density In Different GC Regions

Each human chromosome was divided into consecutive 100K bp-long, nonoverlapping segments. The GC content for each segmentwas calculated and the segment was assigned to one of the fivegroups according to their GC content: <37%, 37%-41%, 41%-46%,46%-52% and >52%. The number of processed pseudogenes in eachgroup was counted, and the pseudogene density for each group wascalculated. Note that we used the same GC content that was usedfor isochore classification (Macaya et al. 1976; Bernardi 2000),although the validity of the isochore definition has been underdebate (Bernardi 2001; Lander et al. 2001).

Expression Analysis

To investigate the possible correlation between the pseudogene abundance and the mRNA expression level, we compared the numberof processed pseudogenes for each functional RP gene with itscellular mRNA expression level in the human cell (Yuval Kluger,pers. comm.) and the yeast cell (Cho et al. 1998). No significantcorrelation was found. Ribosomal protein genes are the most highlyexpressed genes in the cell; it is likely, in this case, thatthe overabundance of mRNA transcripts has made the expressionlevel a nondeciding factor for RP pseudogeneretrotransposition.

	WEB SITE REFERENCES

TOP ABSTRACT INTRODUCTION RESULTS DISCUSSION METHODS WEB SITE REFERENCES REFERENCES

http://www.pseudogene.org/; Pseudogenedatabase.

http://bioinfo.mbb.yale.edu/genome/pseudogene; Pseudogenedatabase.

http://www.ensembl.org/; Ensembldatabase.

	ACKNOWLEDGMENTS

We thank Adam Pavlicek for carefully reading the manuscript and Arian Smit for providing the data on Alu/LINE sequence divergence.Mark Gerstein acknowledges NIH CEGS grant (P50 HG02357-01) forfinancial support. Zhaolei Zhang acknowledges Ted Johnson fordoing the BLAST runs and thanks Paul Bertone, Ronald Jansen, NickLuscombe, Yuval Kluger, and Jiang Qian for helpfuldiscussions.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734solely to indicate thisfact.

	FOOTNOTES

¹ Correspondingauthor.

E-MAIL Mark.Gerstein@yale.edu; FAX (360) 838-7861.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.331902.

REFERENCES

TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
WEB SITE REFERENCES
REFERENCES

	REFERENCES

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402[Abstract/Full Text].
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45-48[Abstract/Full Text].
Ban, N., Nissen, P., Hansen, J., Moore, P.B., and Steitz, T. 2000. The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Nature 400: 841-847[CrossRef].
Bernardi, G. 2000. Isochores and the evolutionary genomics of vertebrates. Gene 241: 3-17[CrossRef][Medline].
-----. 2001. Misunderstandings about isochores. Part 1. Gene 276: 3-13[CrossRef][Medline].
Birney, E., Bateman, A., Clamp, M.E., and Hubbard, T.J. 2001. Mining the draft human genome. Nature 409: 827-828[Medline].
Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94[CrossRef][Medline].
Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2: 65-73[Medline].
Crollius, H.R., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fisher, C., Fizames, C., Wincker, P., Brottier, P., and Quetier, F. 2000. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25: 235-238[CrossRef][Medline].
Draptchinskaia, N., Gustavsson, P., Andersson, B., Pettersson, M., Willig, T.N., Dianzani, I., Ball, S., Tchernia, G., Klar, J., and Matsson, H. 1999. The gene encoding ribosomal protein S19 is mutated in Diamond-Blackfan anaemia. Nat. Genet. 21: 169-175[CrossRef][Medline].
Esnault, C., Maestre, J., and Heidmann, T. 2000. Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24: 363-367[CrossRef][Medline].
Ewing, B. and Green, P. 2000. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 232: 232-233.
Eyre-Walker, A. 1999. Evidence of selection on silent site base composition in mammals: Potential implications for the evolution of isochores and junk DNA. Genetics 152: 675-683[Abstract/Full Text].
Feng, Q., Moran, J.V., LKazazian, H.H., and Boeke, J.D. 1996. Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87: 905-916[Medline].
Feo, S., Davies, B., and Fried, M. 1992. The mapping of seven intron-containing ribosomal protein genes shows they are unlinked in the human genome. Genomics 13: 201-207[Medline].
Fujii, G.H., Morimoto, A.M., Berson, A.E., and Bolen, J.B. 1999. Transcriptional analysis of the PTEN/MMAC1 pseudogene, psiPTEN. Oncogene 18: 1765-1769[Medline].
Glusman, G., Yanai, I., Rubin, I., and Lancet, D. 2001. The complete human olfactory subgenome. Genome Res. 11: 685-702[Abstract/Full Text].
Goncalves, I., Duret, L., and Mouchiroud, D. 2000. Nature and structure of human genes that generate retropseudogenes. Genome Res. 10: 672-678[Abstract/Full Text].
Goodman, M., Porter, C.A., Czelusniak, J., Page, S.L., Schneider, H., Shoshani, J., Gunnell, G., and Groves, C.P. 1998. Toward a phylogenetic classification of Primates based on DNA evidence complemented by fossil evidence. Mol. Phylogenet. Evol. 9: 585-598[CrossRef][Medline].
Harrison, P.M., Hegyi, H., Balasubramanian, S., Luscombe, N.M., Bertone, P., Echols, N., Johnson, T., and Gerstein, M. 2002a. Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res. 12: 272-280[Abstract/Full Text].
Harrison, P.M., Kumar, A., Lang, N., Snyder, M., and Gerstein, M. 2002b. A question of size: The eukaryotic proteome and the problems in defining it. Nucleic Acids Res. 30: 1083-1090[Abstract/Full Text].
Harrison, P.M., Hegyi, H., Bertone, P., Echols, N., Johnson, T., Balasubramanian, S., Luscombe, N., and Gerstein, M. 2001. Molecular fossils in the human genome: Identification and analysis of processed and non-processed pseudogenes in chromosomes 21 and 22. Genome Res. 12: 272-280.
Hohjoh, H. and Singer, M.F. 1996. Cytoplasmic ribonucleoprotein complexes containing human LINE-1 protein and RNA. EMBO J. 15: 630-639[Abstract].
-----. 1997a. Ribonuclease and high salt sensitivity of the ribonucleoprotein complex formed by the human LINE-1 retrotransposon. J. Mol. Biol. 271: 7-12[CrossRef][Medline].
-----. 1997b. Sequence-specific single-strand RNA binding protein encoded by the human LINE-1 retrotransposon. EMBO J. 16: 6034-6043[Abstract/Full Text].
Hubbard, T., Barker, D., Birney, E., Camero, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 38-41[Abstract/Full Text].
Jurka, J. 1997. Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc. Natl. Acad. Sci. 94: 1872-1877[Abstract/Full Text].
Kazazian, H.H., Jr. and Moran, J.V. 1998. The impact of L1 retrotransposons on the human genome. Nat. Genet. 19: 19-24[Medline].
Kenmochi, N., Kawaguchi, T., Rozen, S., Davis, E., Goodman, N., Hudson, T.J., Tanaka, T., and Page, D.C. 1998. A map of 75 human ribosomal protein genes. Genome Res. 8: 509-523[Abstract/Full Text].
Kenmochi, N., Yoshihama, M., Higa, S., and Tanaka, T. 2000. The human ribosomal protein L6 gene in a critical region for Noonan syndrome. J. Human Genet. 45: 290-293.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120[Medline].
Kumar, S., Tamura, K., Jakobsen, I.B., and Nei, M. 2001. MEGA2: Molecular evolutionary genetics analysis software. Bioinformatics 17: 1244-1245[Abstract/Full Text].
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921[CrossRef][Medline].
Leibold, D.M., Swergold, G.D., Singer, M.F., Thayer, R.E., Dombroski, B.A., and Fanning, T.G. 1990. Translation of LINE-1 DNA elements in vitro and in human cells. Proc. Natl. Acad. Sci. 87: 6990-6994[Abstract].
Li, W.-H. 1997. Molecular Evolution Sinauer Associates, Inc., Sunderland, MA.
Macaya, G., Thiery, J.P., and Bernardi, G. 1976. An approach to the organization of eukaryotic genomes at a macromolecular level. J. Mol. Biol. 108: 237-254[Medline].
Mager, W.H., Planta, R.J., Ballesta, J.G., Lee, J.C., Mizuta, K., Suzuki, K., Warner, J.R., and Woolford, J. 1997. A new nomenclature for the cytoplasmic ribosomal proteins of Saccharomyces cerevisiae. Nucleic Acids Res. 25: 4872-4875[Abstract/Full Text].
Makalowski, W., Zhang, J., and Boguski, M.S. 1996. Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6: 846-857[Abstract].
Martin, S.L. 1991. Ribonucleoprotein particles with LINE-1 RNA in mouse embryonal carcinoma cells. Mol. Cell. Biol. 11: 4804-4807[Medline].
McCarrey, J.R., Kumari, M., Aivaliotis, M.J., Wang, Z., Zhang, P., Marshall, F., and Vandeberg, J.L. 1996. Analysis of the cDNA and encoded protein of the human testis-specific PGK-2 gene. Dev. Genet. 19: 321-332[CrossRef][Medline].
Mighell, A.J., Smith, N.R., Robinson, P.A., and Markham, A.F. 2000. Vertebrate pseudogenes. FEBS Lett. 468: 109-114[CrossRef][Medline].
Moran, J.V., Holmes, S.E., Naas, T.P., DeBerardinis, R.J., Boeke, J.D., and Kazazian, H.H., Jr. 1996. High frequency retrotransposition in cultured mammalian cells. Cell 87: 917-927[Medline].
Mouchiroud, D., D'Onofrio, G., Aissani, B., Macaya, G., Gautier, C., and Bernardi, G. 1991. The distribution of genes in the human genome. Gene 100: 181-187[Medline].
Olsen, M.A. and Schechter, L.E. 1999. Cloning, mRNA localization and evolutionary conservation of a human 5-HT7 receptor pseudogene. Gene 227: 63-69[CrossRef][Medline].
Pavlicek, A., Jabbari, K., Paces, J., Paces, V., Hejnar, J., and Bernardi, G. 2001. Similar integration but different stability of Alus and LINEs in the human genome. Gene 276: 39-45[CrossRef][Medline].
Pearson, W.R. 1997. Comparison of DNA sequences with protein sequences. Genomics 46: 24-36[CrossRef][Medline].
Planta, R.J. and Mager, W.H. 1998. The list of cytoplasmic ribosomal proteins of Saccharomyces cerevisiae. Yeast 14: 471-477[CrossRef][Medline].
Raue, H.A. and Planta, R.J. 1991. Ribosome biogenesis in yeast. Prog. Nucleic Acid Res. Mol. Biol. 41: 89-129[Medline].
Schluenzen, F., Tocilj, A., Zarivach, R., Harms, J., Gluehmann, M., Janell, D., Bashan, A., Bartels, H., Agmon, I., Franceschi, F. 2000. Structure of functionally activated small ribosomal subunit at 3.3 angstroms resolution. Cell 102: 615-623[Medline].
Smit, A.F. 1999. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9: 657-663[CrossRef][Medline].
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-197[Medline].
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680[Abstract].
Uechi, T., Tanaka, T., and Kenmochi, N. 2001. A complete map of the human ribosomal protein genes: Assignment of 80 genes to the cytogenetic map and implications for human disorders. Genomics 72: 223-230[CrossRef][Medline].
Vanin, E.F. 1985. Processed pseudogenes: Characteristics and evolution. Annu. Rev. Genet. 19: 253-272[Medline].
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., and Holt, R.A. 2001. The sequence of the human genome. Science 291: 1304-1351[Abstract/Full Text].
Wei, W., Gilbert, N., Ooi, S.L., Lawler, J.F., Ostertag, E.M., Kazazian, H.H., Jr., Boeke, J.D., and Moran, J.V. 2001. Human L1 retrotransposition: cis preference versus trans complementation. Mol. Cell. Biol. 21: 1429-1439[Abstract/Full Text].
Weiner, A.M. 1999. Do all SINEs lead to LINEs? Curr. Biol. 9: 842-844.
Wimberly, B.T., Brodersen, D.E., Clemons, W.M., Morgan-Warren, R.J., Carter, A.P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. 2000. Structure of the 30S ribosomal subunit. Nature 407: 323-339.
Wool, I.G. 1996. Extraribosomal functions of ribosomal proteins. TIBS 21: 164-165[Medline].
Wool, I.G., Chan, Y.L., and Gluck, A. 1995. Structure and evolution of mammalian ribosomal proteins. Biochem. Cell Biol. 73: 933-947[Medline].
Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17: 149-163.
Yeh, R.F., Lim, L.P., and Burge, C. 2001. Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816[Abstract/Full Text].
Yoshihama, M., Uechi, T., Asakawa, S., Kawasaki, K., Kato, S., Higa, S., Maeda, N., Minoshima, S., Tanaka, T., Shimizu, N. 2002. The human ribosomal protein genes: Sequencing and comparative analysis of 73 genes. Genome Res. 12: 379-390[Abstract/Full Text].
Yusupov, M.M., Yusupova, G.Z., Baucom, A., Lieberman, K., Earnest, T.N., Cate, J.H., and Noller, H.F. 2001. Crystal structure of the ribosome at 5.5 A resolution. Science 292: 868-869[Full Text].
Zinn, A.R., Page, D.C., and Fisher, E.M. 1993. Turner syndrome: The case of the missing sex chromosome. Trends Genet. 9: 90-93[Medline].

Received April 3, 2002; accepted in revised form August 12, 2002.

Abstract of this Article

Reprint (PDF) Version of this Article