Genome Research -- Harrison et al. 12 (2): 272

Institution: Yale University Sign In as Individual

Abstract of this Article (

)

Reprint (PDF) Version of this Article

Email this article to a friend

LETTER
Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22

Paul M. Harrison, Hedi Hegyi, Suganthi Balasubramanian, Nicholas M. Luscombe, Paul Bertone, Nathaniel Echols, Ted Johnson, and Mark Gerstein¹

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
RESULTS AND DISCUSSION
CONCLUSIONS
METHODS
REFERENCES

	ABSTRACT

We have developed an initial approach for annotating and surveying pseudogenes in the human genome. We search human genomicDNA for regions that are similar to known protein sequences andcontain obvious disablements (i.e., mid-sequence stop codons orframeshifts), while ensuring minimal overlap with annotationsof known genes. Pseudogenes can be divided into "processed" and"nonprocessed"; the former are reverse transcribed from mRNA (andtherefore have no intron structure), whereas the latter presumablyarise from genomic duplications. We annotate putative processedpseudogenes based on whether there is a continuous span of homologythat is >70% of the length of the closest matching human protein(i.e., with introns removed), or whether there is evidence ofpolyadenylation. We have applied our approach to chromosomes 21and 22, the first parts of the human genome completely sequenced,finding 190 new pseudogene annotations beyond the 264 reportedby the sequencing centers. In total, on chromosomes 21 and 22,there are 189 processed pseudogenes, 195 nonprocessed pseudogenes,and, additionally, 70 pseudogenic immunoglobulin gene segments.(Detailed assignments are available at http://bioinfo.mbb.yale.edu/genome/pseudogeneor http://genecensus.org/pseudogene.) By extrapolation, we predictthat there could be up to ~20,000 pseudogenes in the whole humangenome, with a little more than half of them processed. We havedetermined the main populations and clusters of pseudogenes onchromosomes 21 and 22. There are notable excesses of pseudogenesrelative to genes near the centromeres of both chromosomes, indicatingthe existence of pseudogenic "hot-spots" in the genome. We havelooked at the distribution of InterPro families and Gene Ontology(GO) functional categories in our pseudogenes. Overall, the familiesin both processed and nonprocessed pseudogene populations occuraccording to a similar power-law distribution as that found forthe occurrence of gene families, with a few big families and manysmall ones. The processed population is, in particular, enrichedin highly expressed ribosomal-protein sequences (~20%), whichappear fairly evenly distributed across the chromosomes. We comparedprocessed pseudogenes of different evolutionary ages, observinga high degree of similarity between "ancient" and "modern" subpopulations.This may be attributable to the consistently high expression ofribosomal proteins over evolutionary time. Finally, we find thatchromosome 22 pseudogene population is dominated by immunoglobulinsegments, which have a greater rate of disablement per amino acidthan the other pseudogene populations and are also substantiallymorediverged.

	INTRODUCTION

TOP ABSTRACT INTRODUCTION RESULTS AND DISCUSSION CONCLUSIONS METHODS REFERENCES

Pseudogenes are disabled copies of genes that do not produce a functional, full-length copy of a protein (Mighell et al. 2000;Vanin 1985). They are of two types: First, processed pseudogenesresult from reverse transcription of messenger RNA transcriptsfollowed by reintegration into genomic DNA (presumably in germ-linecells) and subsequent degradation with disablements (prematurestop codons and frameshifts) (Vanin 1985). Second, nonprocessedpseudogenes result from duplication of a gene, followed by aninitial disablement if the duplicated copy is not "useful" (Mighellet al. 2000). These then also accumulate further codingdisablements.

The extent of the pseudogene population in the human genome is unclear. Estimates for the number of human genes range from~22,000 to ~75,000 (Crollius et al. 2000; Ewing and Green 2000;Lander et al. 2001; Venter et al. 2001; Wright et al. 2001). Fromprevious reports, it is thought that up to 22% of these gene predictionsmay be pseudogenic (Lander et al. 2001; Yeh et al. 2001). It isimportant to characterize the human processed and nonprocessedpseudogene populations as their existence interferes with geneidentification and prediction (particularly nonprocessed pseudogenesor individual pseudogenic exons). They are also an important resourcefor the study of the evolution of protein families (see, e.g.,studies on the human olfactory receptor subgenome [e.g. Glusmanet al. 2001]).

Here, we have performed a detailed analysis of the pseudogene populations of human chromosomes 21 and 22, which have beensequenced contiguously to high quality. This is similar in spiritto previous surveys we have performed on pseudogenes and othergenomic features in other organisms (Harrison et al. 2001; Gerstein1997, 1998; Hegyi and Gerstein 1999). We have examined the mainpopulations and clusters of pseudogenes for the two chromosomes.Patterns of distribution of both nonprocessed and processed pseudogenesindicate the existence of pseudogenic hot-spots in the human genome.In addition, we have estimated the total numbers and proportionsof processed and nonprocessed pseudogenes in the whole humangenome.

	RESULTS AND DISCUSSION

TOP ABSTRACT INTRODUCTION RESULTS AND DISCUSSION CONCLUSIONS METHODS REFERENCES

We annotated both processed and nonprocessed pseudogenes, as described in Figure 1 and the Methods section. The numbers ofprocessed and nonprocessed pseudogenes that we find are summarizedin Table 1. As shown in Figure 1, there are 60 Sanger pseudogeneannotations in excess of those pseudogenes that we find for chromosome22, and 20 Riken pseudogene annotations in excess for chromosome21.

View larger version (32K):
[in this window]
[in a new window] Figure 1 Flow diagram of the scheme for assignment of pseudogenes. The schematic shows the steps in assignment of pseudogenes. Ovals denote sources of data, and boxes denote operations. The term " $psi$ g" denotes "pseudogene." The steps are as follows (described in detail in the text): (1) Six-frame blast. Comparison of SWISSPROT database to chromosome 21 and 22 genomic sequences using BLAST (Altschul et al. 1997) to find potential pseudogenic protein homologies (with stop codons in them). (2) FASTA realignment. Realignment with the FASTA package (Pearson et al. 1997) of the top-matching sequence for the potential pseudogene to find longest protein-homology fragment that has >1 disablement (frameshift or premature stop codon). (3) Minimize overlap with known genes. Overlap of putative pseudogenes with known human genes (from the Sanger center annotations for chromosome 22 and the Riken center annotations for 21) is minimized by choosing a suitable margin at the ends of pseudogenic homologies within which to ignore disablements. (4) Merge with existing $psi$ g annotations. Pseudogene annotations from the Sanger and Riken centers are merged with those that are duplicates of these in our own set of annotations being deleted. (5) Date by finding closest matching Ensembl protein. For each pseudogene, the closest matching Ensembl human protein was found so that the pseudogene could be approximately dated. This was realigned to the genomic DNA sequence (backward step denoted by dotted arrow) and used as a replacement if it produced a longer pseudogene. (6) Assess for processing. All pseudogenes were then assessed for processing by searching for evidence of polyadenylation or extensive spans of protein homology in the absence of exon structure.

View this table:
[in this window]
[in a new window]

Table 1. Numbers of Pseudogenes and Genes

Processed Pseudogenes on Chromosomes 21 and 22

We find a total of 189 processed pseudogenes (77 on chromosome 21, 112 on chromosome 22) (Table 1). The total for chromosome22 is a combination of our own annotations and Sanger Center annotationsfor pseudogenes, whereas the total for chromosome 21 is a combinationof our annotations and those obtained from the Riken genome-sequencingcenter. The number of processed pseudogenes for chromosome 22relative to chromosome 21 is rather high (proportion = 112/77~1.45). When we remove the additional Riken and Sanger Centerpseudogenes, the density of processed pseudogenes is still moderatelyhigher for chromosome 22 relative to 21 (proportion = 83/65 ~1.30).The different numbers of processed pseudogenes for the two chromosomesis intriguing and may be related to the accessibility of genomicDNA for reintegration of a processed sequence, with chromosomeswith more genes having more accessible genomic DNA because oftranscriptionalactivity.

Non-Processed Pseudogenes on Chromosomes 21 and 22

For the counting of nonprocessed pseudogenes, we set aside the 70 $kappa$ and $lambda$ immunoglobulin gene segments on chromosome 22 asa separate population. We then have a total of 195 nonprocessedpseudogenes (72 on chromosome 21, 123 on 22). Considering allannotations, the number of nonprocessed pseudogenes on chromosome22 relative to the number on chromosome 21 is higher (proportion= 123/72 ~1.71). As described above for processed pseudogenes,when those pseudogenes that arise only from Riken and Sanger Centerannotations are excluded, the ratio of pseudogene numbers betweenthe two chromosomes is more modest (proportion ~1.28), reflectingto less of an extent the corresponding relative gene density betweenthe chromosomes (~2.13-2.24 for all sets of geneannotations).

Extrapolation to the Whole Human Genome

Based on our numbers of pseudogenes for chromosomes 21 and 22, we can tentatively extrapolate to derive estimates of the pseudogenenumbers in the whole humangenome.

Using the total number of processed pseudogenes for either chromosome 22 or 21, we estimate the total number of processedpseudogenes in the human genome (see Table 1 footnote). The predictedranges are ~8700-9400 (based on chromosome 22 data) and ~6100-6600(chromosome 21) processed pseudogenes in the whole human genome.(The lower number in the range arises from using the total humangenome size given by Lander et al. 2001; the higher from the sizegiven by Venter et al. 2001.)

Also, as for processed pseudogenes, we estimate a predicted range of ~9600-10,400 nonprocessed pseudogenes in the human genome,extrapolating from the chromosome 22 data. Using the gene-poorchromosome 21, a much lower estimate is obtained (~5700-6200).Arguably, we would expect more of a relationship between nonprocessedpseudogene density and gene density than between processed pseudogenedensity and gene density, as the former type of pseudogenes arisesfrom duplication of the genomic DNA. One could modify such estimatesto account for lower gene density on chromosome 22 than on otherhuman chromosomes (Dunham et al. 1999a; Lander et al. 2001; Venteret al. 2001), so the number of nonprocessed pseudogenes in thewhole human genome may be even higher. However, as noted above,disregarding the $kappa$ and $lambda$ immunoglobulin variable-region gene segments,there does not seem to be a clear relationship between gene densityand nonprocessed pseudogene density for thesechromosomes.

Overlap with Known Sequencing Center (Riken/Sanger) Genes

During our pseudogene annotation procedure, we find that some potential pseudogenes overlap known genes; this overlap maybe due to sequence alignment artifact or to a real phenomenonof discarded fragments of disabled protein homology near the extantparts of genes. As part of our assignment procedure, the allowedpercentage of known gene exons overlapped by our pseudogene annotationsis <5% (Methods section). Known genes are those labeled as "known"in either the Sanger or Riken annotations, that is, having a previouslycharacterized genomic structure. For exons of known Sanger genes,this percentage of overlapped exons is 3.3%, and for Riken knowngene exons, it is 2.6%. Similar levels of this overlap for exons(~3%) are found for all other (predicted) gene annotations fromthe Riken and Sanger centers (Hattori et al. 2000; Dunham et al.1999a).

Overlap with Genes Predicted by GenomeScan

Genes predicted by the program GenomeScan (Yeh et al. 2001) were studied as a larger and more uniformly predictedset of genes than the gene annotations available from the Sangerand Riken centers. We examined the overlap of the pseudogene datasets with genes predicted by GenomeScan (Table 1). Forthe GenomeScan-predicted exons, on 21 and 22, there isonly 5.2% and 6.2% overlap, respectively, of exons withpseudogenes.

Main Populations and Clusters of Chromosome 21 and 22 Pseudogenes

We now focus on the main populations and clusters of pseudogenes on chromosomes 21 and 22. To aid with our characterizationof these, we determined the prevalence of InterPro motifs (Apweileret al. 2000) in the pseudogene sets and used these to assign GOfunctional classes (Ashburner et al. 2000) for the processed andnonprocessed pseudogene populations (Methodssection).

Power-Law Behavior of the Occurrence of Families for Pseudogenes

We examine the distribution of InterPro protein families in the processed and nonprocessed pseudogenes. For the overall groupsof processed pseudogenes, nonprocessed pseudogenes and total pseudogenes,there is a power-law relationship between the number of InterProfamilies and the size of a family (the number of members in afamily) (Fig. 2), if one removes the outliers that are labeled.These outliers are the zinc finger motif, which occurs in multiplecopies (of up to 12) in a sequence, and the collagen triple helixrepeat, which also occurs multiple times for the same sequence.There is also a point on the plot for the immunoglobulin (Ig)domain (which occurs in the Ig variable-region gene segments).That is, the trend can be fitted to a straight line when plottedon a log-log scale. This relationship has also been observed forgenes in eukaryotes and other features of genomes (J. Qian etal. 2001

). As described in the Methods section, we used the InterProfamily assignments for the pseudogenes to assign a GO functionalclass when possible. However, the distribution of GO functionalclasses in the pseudogenes does not show the same sort of linearityon a log-log plot (data not shown).

View larger version (15K):
[in this window]
[in a new window]

Figure 2 The relationship between the number of InterPro families for pseudogenes and their sizes shows a power-law behavior. The number of InterPro families is plotted vs. the size of a family on a log-log scale. "All pseudogenes" (processed and nonprocessed combined) are plotted with a filled diamond, processed pseudogenes with a cross, and nonprocessed pseudogenes with an unfilled circle. The straight line indicates the best least-squares linear fit to all points for all pseudogenes, except for the outliers that are labeled in the plot. This is indicative of a power-law relationship between the size of a protein family and the number of families that have this size.

The Largest Group of Processed Pseudogenes Is in the Ribosomal Class

Based on GO function classifications, the most common group of proteins in the processed pseudogenes is the ribosomal proteins,comprising 22% (42 out of 189 from GO classification) (Fig. 3).Over half of these are from the large subunit of the ribosome(60%, 25/42). This is close to the proportion of large subunitribosomal proteins in the human ribosome (57%), implying thatribosomal proteins are evenly sampled for processed pseudogeneformation. As a fraction of the estimates for the overall numberof processed pseudogenes in the human genome (Table 1), thismeans there may be >1800 ribosomal-protein processed pseudogenesin the complete genome. In comparison, for ribosomal protein genes,there is only one actual gene for a ribosomal protein (RPL3) onchromosome 22, and no ribosomal protein gene on chromosome 21(Uechi et al. 2001

). Interestingly, there is a nonprocessed pseudogenefor the ribosomal protein RPL17, on chromosome 22 (named bK440B3.1),although the "live" homolog for this pseudogene is on chromosome18. The processed pseudogenes for the ribosomal proteins appearto be somewhat evenly distributed on the chromosomes (Fig. 4),although the sample is rather small for the two chromosomes surveyed,at the moment.

View larger version (65K):
[in this window]
[in a new window]

Figure 3 Distribution of genes and pseudogenes on chromosomes 21 and 22 into GO functional categories. For each table within the figure, the total is split up into the number for chromosome 21 plus number for chromosome 22 and is sorted in decreasing order of this total. Each GO functional category is given a different color. The table at top left lists the top five GO functional categories for genes. The table at top right lists the top five GO categories for pseudogenes (processed and nonprocessed combined). This is split up into processed and nonprocessed; processed pseudogenes are further divided into ancient and modern sets, as described in the text. The number of immunoglobulin pseudogenic fragments is lumped in with the nonprocessed population. The GO functional categories in the figure are as follows (with GO numbers): (1) "Transcription factor," GO:0003700. For processed and nonprocessed pseudogenes these all arise from zinc finger C2H2 type. (2) "Other DNA-binding," all DNA-binding (GO:0003677) except "Transcription factor." The proteins found for processed pseudogenes in this class are as follows: (IPR000910) HMG1/2 (high mobility group) box; (IPR000210) BTB/POZ domain; (IPR000637) HMG-I and HMG-Y DNA-binding domain (A + T-hook); (IPR002119) Histone H2A; (IPR000079) High mobility group proteins HMG14 and HMG17; (IPR001514) RNA polymerases D/30 to 40 Kd subunits; and (IPR001005) Myb DNA binding domain. (3) "Nucleotide-binding," GO:0000166. (4) "Nucleic-acid binding," GO:0003676 (this class arises from motifs or domains that cannot be classified specifically as "DNA-binding" or "RNA-binding"). (5) "Kinase," GO:0016301. (6) "Ribosomal protein," GO:0003735. (7) "Receptor," GO:0004930. (8) "Transferase," GO:0016740. (9) "RNA binding," GO:0003723. The proteins found for processed pseudogenes in this class are as follows: (IPR001014) Ribosomal L23 protein (also classed as ribosomal protein); (IPR001410) DEAD/DEAH box helicase; (IPR002942) S4 domain; (IPR001892) Ribosomal protein S13 (twice, also classed as ribosomal protein). The protein found for nonprocessed pseudogenes in this class is: (IPR001965) PHD finger. (10) "Oxidoreductase," GO:0016491. (11) "Cell cycle regulator," GO:0003750.

View larger version (27K):
[in this window]
[in a new window] Figure 4 Distribution of pseudogene and gene densities for chromosomes 21 and 22. On the left are panels for chromosome 21, and on the right are panels for chromosome 22. For each chromosome, the panels are genes predicted by GenomeScan and the genome sequencing centers (Riken for chromosome 21 and Sanger for 22) (top), nonprocessed pseudogenes (middle), and processed pseudogenes (bottom). Each bin is named x for the interval x to x + 5 Mb. The first bin contains ~300,000 bases that are beyond the centromere (containing two genes and six pseudogenes). The final bin ends at the end of the telomere. The bins for the pseudogenic hot-spots referred to in the text are asterisked. For processed pseudogenes, we have added a representation of the distribution of ribosomal-protein processed pseudogenes along the chromosomes at the bottom of the panels, with a dot for each ribosomal-protein pseudogene at its approximate position along the chromosome.

Ancient and Modern Processed Pseudogenes

We divided the processed pseudogene population by age into approximately equal-sized groups of ancient and modern processedpseudogenes using the median percentage identity value (whichis 79%). This is based on the similarity of the pseudogenes tothe closest matching human protein in the Ensembl database (Birneyet al. 2001

). Ancient pseudogenes have <79% sequence identity(see Fig. 5) to their closest matching human protein. The remaindercomprise modern pseudogenes. We examined the ancient and modernprocessed pseudogene populations for their prevalent GO functionalclasses (Methods section). Ancient and modern processed pseudogenesdo not have any overwhelmingly obvious prevalences; the "ribosomalstructural protein" functional class dominates both sets. Transcriptionfactors and other DNA-binding proteins tend to be in the ancientcategory (Fig. 3).

View larger version (19K):
[in this window]
[in a new window]

Figure 5 Distribution of the percent identity to closest-matching Ensembl proteins for processed, nonprocessed, and immunoglobulin gene segment pseudogenes. The percentage identity to the closest matching Ensembl human protein for pseudogenes. The bin named x contains every value y such that x < y < (x + 10)%. The panels are processed pseudogenes (top), nonprocessed pseudogenes (middle), and immunoglobulin pseudogenic gene segments (bottom).

Main Nonprocessed Pseudogene Populations

We examined the nonprocessed pseudogene populations for chromosomes 21 and 22 for their prevalent functional classes and comparedthem with the classes for genes predicted using GenomeScan(Yeh et al. 2001

). The total number of InterPro motif assignmentsand consequently GO-class assignments is much smaller for thenonprocessed pseudogenes in comparison with the gene totals andthose for processed pseudogenes (Fig. 3). There is little similarityamong all three lists (genes, processed pseudogenes, and nonprocessedpseudogenes for chromosomes 21 and 22 combined); the "transcriptionfactor" functional classes occurs in the top five of all three(processed pseudogenes, nonprocessed pseudogenes, and genes, Fig.3). The "receptor" class is common to the top five of both processedand nonprocessed pseudogenes. The nonprocessed pseudogenes sharethree of their top five functional categories with the top fivefor the GenomeScan-predicted genes. On a related note,in general, a large proportion of the nonprocessed pseudogenesare close to a homolog along the chromosomes: 28% have at leastone homolog within 0.5 Mb and 31% have at least one within 1.0Mb.

Immunoglobulin Gene Segments

There are a total of 70 $kappa$ and $lambda$ immunoglobulin (Ig) variable-region pseudogenic gene segments (65 $lambda$ , 5 $kappa$ ) in the chromosome22 loci for these gene segments. We find only an additional two( $lambda$ ) pseudogenic gene segments relative to those already annotatedby the Sanger Center (included in this total). Ig variable-regiongene segments have a higher rate of nonsynonymous substitutionin the germ line relative to synonymous substitution (Nei et al.1997). We examined the variable-region Ig pseudogenic gene segmentsfor the total number of disablements detected relative to theclosest matching human protein sequences from the Ensembl database(Birney et al. 2001). We find a moderately increased rate of disablementsper amino acid relative to the corresponding overall rate in pseudogenicsequences: 3.3% in Ig segments (106/3253) relative to 2.5% overall(1704/67965). This difference is statistically significant (thechance that it would arise randomly is P <0.002, assuming normaldistribution statistics), and is unaffected by removing the five $kappa$ segments. This increased rate of disablement is consistent withthe increased nonsynonymous substitution rate for Ig variable-regionloci referred to in the literature (Nei et al. 1997). The nonprocessedpseudogene population on a whole has a slightly higher rate ofdisablement than the processed one, 2.6% (837/32843) versus 2.4%(759/31868). The degree of identity between the Ig pseudogenicgene segments and their closest matching Ensembl human proteinsequences is also much lower on average (59.2% [+13.9]) than foreither processed or nonprocessed pseudogenes (72.4% (+20.4) and75.1% (+19.1), respectively); these latter two categories alsohave similar-shaped distributions (Fig. 5).

Pseudogenic Hot-Spots

The density of genes or pseudogenes is defined as their number per interval of DNA. We have illustrated the trends for thelargest interval for which we obtain any meaningful separationalong the chromosomes (Fig. 4). We searched for the most notabledifferences in the pseudogene density and the gene density (eitherprocessed and nonprocessed), where they are observed for boththe GenomeScan genes and the Riken/Sanger complete setsof gene annotations. We find that the most notable increased densityfor both processed and nonprocessed pseudogenes relative to thegene density is near the centromeres (in the first 5 Mb; differencein density, $Delta$ D > 0.10; Fig. 4). The most notable excess in genedensity relative to the pseudogene density is at the telomereof chromosome 21, where there are few processed pseudogenes ( $Delta$ D= $-$ 0.13); this area contains predicted collagen genes and nonprocessedpseudogenes. It will be interesting to see if regions of increasedpseudogene density in the absence of increased gene density orpseudogenic hot-spots can be found on a larger scale in the totalhuman genome. In general, pseudogenes in such regions may be moredetectable because they take longer to be degraded; this may occur,perhaps, through local variations in DNA duplication rate relativeto the rate of loss of genomic DNA (Petrov 2001).

The G + C content of genomic DNA is related to gene content, with G+C-rich regions having elevated numbers of genes relativeto G+C-poor regions (Dunham et al. 1999; Lander et al. 2001; Venteret al. 2001). There does not appear to be any obvious relationshipbetween pseudogene content and G + C content that can be readilydecoupled from the known link between gene density and G + C content.On chromosome 21, the most G + C-poor region, which is between5 and 12 Mb from the start of the sequence, has low G + C (35%)compared with the rest of the chromosome (43%), and has low pseudogenecontent as well as low gene content (Hattori, et al. 2000); onchromosome 22, the most notable G + C-poor region, the 2 Mb closestto the centromere (<40% G + C; Dunham, et al. 1999), has elevatedpseudogene content relative to gene content (Fig. 5). Evidently,this topic will be amenable to in-depth study with a larger dataset of pseudogenes derived from the whole humangenome.

	CONCLUSIONS

TOP ABSTRACT INTRODUCTION RESULTS AND DISCUSSION CONCLUSIONS METHODS REFERENCES

We have derived a procedure for the assessment of processed and nonprocessed pseudogenes in genomic DNA by looking for disabledprotein homologies while minimizing the overlap with known genes;using this, we have predicted the pseudogene populations of chromosomes21 and 22, finding 180 pseudogenes additional to existing availableannotations. Also, we have tentatively extrapolated that thereare up to ~9000 processed and ~10,000 nonprocessed pseudogenesin the human genome. Up to 6% of annotated exons in these twochromosomes may be pseudogenic. Based on GenomeScan genepredictions, modified totals for the actual number of genes onchromosomes 21 and 22 are given in Table 1.

Other types of protein-related pseudogenes that are not accounted for in the present work are semiprocessed (pseudo)genes(arising from aberrant mRNAs that contain an intron) and pseudogenesthat produce transcripts (but not protein chains). Surveys ofthe literature by the authors indicate, however, that the occurrenceof either of these is relatively rare and is not likely to affectgene prediction significantly. From examination of the distributionof pseudogenes along chromosomes 21 and 22, there is some evidenceof the existence of pseudogenic hot-spots; this will remain tobe confirmed upon examination of the whole human genome. Thisstudy serves as preparation for such a whole-genomesurvey.

The density of pseudogenes relative to genes derived in this study seems very high (one processed and one nonprocessed pseudogenefor every ~ four genes), with a total of ~390 found in the 70Mb of chromosomes 21 and 22, with >97% noncoding DNA. By comparison,a moderately-sized complement of > ~1100 verified pseudogenes(corresponding to ~19,000 genes) was found in the 100-Mb wormgenome (Harrison et al. 2001), which has ~70% noncoding DNA. Estimatesfor the other eukaryote genomes are at present unavailable, althoughfor the fly genome (which has 120 Mb of euchromatic DNA with ~80%noncoding DNA), a survey by the authors indicates ~100 pseudogenes(P. Harrison and M. Gerstein, submitted). There appears to beno obvious relationship between the proteome size or genome sizeor the amount of noncoding DNA and the number of pseudogenes forworm, fly, and human. Contributing factors would include the rateof gene duplication, the occurrence of transposable elements,and the overall rate of genomic DNA loss (Petrov 2001). Also,for prokaryotes, there does not appear to be a clear relationshipbetween the amount of noncoding DNA and the number of pseudogenes.There are two reported cases of prokaryotic genomes with highproportions of noncoding DNA: Rickettsia prowazekii has 24% noncodingDNA and 12 pseudogenes (Andersson et al. 1998); whereas Mycobacteriumleprae has 51% noncoding DNA, 27% of which is composed of a populationof 1100 pseudogenes (corresponding to a proteome of 1604 codingsequences) (Cole et al. 2001). Evidently, surveys of more eukaryoteand prokaryote genomes are required to give a fullerpicture.

	METHODS

TOP ABSTRACT INTRODUCTION RESULTS AND DISCUSSION CONCLUSIONS METHODS REFERENCES

Determination of a Set of Pseudogenes for Human Chromosomes 21 and 22

We developed an initial scheme for identifying pseudogenes in human genomic DNA; this is depicted as a flow diagram in Figure1. Genomic annotation is an inherently dynamic process in whichit is necessary to make use of many different sources of data(represented by ovals in the flow diagram), which are not updatedin a concerted fashion. (Detailed files listing our assignmentsare available at http://bioinfo.mbb.yale.edu/genome/pseudogeneand http://genecensus.org/pseudogene.)

Six-Frame BLAST

Using the BLAST alignment package (Altschul et al. 1997

) with repeats masked using RepeatMasker (Bedellet al. 2000

), we compared the SWISSPROT database of protein sequences(version 40) (Bairoch and Apweiler 2000

) with the complete availablesequences of human chromosomes 21 and 22 in six frames (Dunhamet al. 1999b

; Hattori et al. 2000

). Chromosome 21 was downloadedfrom GenBank on August 22, 2000; chromosome 22 is the May 9, 2000,version). We then took all of the significant protein matchesto the genomic DNA (e-value < 1 × 10^-4), and reduced them for mutual overlap by picking matches in decreasingorder of significance and deleting any matches that overlap substantiallywith a picked match (i.e., more than 10 aminoacids).

FASTA Realignment

For each match, we then realigned the matching SWISSPROT sequence to the same region of genomic DNA using the FASTAprogram (Pearson et al. 1997

), expanded on either side by thelength of the matching sequence (in nucleotides). From these alignments,we picked any matching sequences that had more than one disablement(either a frameshift or a premature stop codon) as a potentialpseudogene. At this stage, these potential sequences were filteredfor low complexity by comparing with the same SWISSPROT databasebut using masking with SEG (settings "25 3.0 3.3" and "45 3.43.75") (Wootton and Federhen 1996

Minimize Overlap with Known Genes

To ensure that we are not considering disablements at the end of alignment subsequences that are artifactual, we examinedhow these potential pseudogenic sequences overlapped with theknown genes on human chromosome 22 (Dunham et al. 1999a

). We calculatedthe position of the disablement that is nearest the middle ofeach potential pseudogenic sequence and the distance (d) fromthis particular disablement to the closest end of the sequence.We then determined an appropriate margin (m) for the ends of thepseudogenic sequences, which allows us to discard pseudogenesso that the total set of pseudogenes overlaps only a small proportion(<5%) of the set of known gene exons. Following this criterion,all pseudogenes were discarded if d < m, where m = 16 residues.This gave us a total of 3.3% of all known gene exons that areoverlapped by our set of pseudogenepredictions.

Merging

At this stage, the pseudogene predictions for chromosome 22 were merged with previous pseudogene predictions provided by theSanger Center (Dunham et al. 1999b

); those for chromosome 21 weremerged with data downloaded from the Riken sequencing Center Website (Hattori et al. 2000

). Where a predicted pseudogene fromthe present data set was duplicated by a Riken or Sanger Centerpseudogene, the Riken/Sanger Center annotation is chosen in preference.The nature of the overlap of the data sets is shown in box 4 ofthe flow diagram (Fig. 1).

Date by Closest Match Ensembl Protein

For each pseudogene sequence, we searched through the most current version of the Ensembl database (http://www.ensembl.org;[Birney et al. 2001

]) of human coding sequences to find the closestknown human sequence homolog. If a matching sequence from Ensemblwas found (only 84% of Sanger Center annotations have a correspondingEnsembl database protein; 92% of our own pseudogenes have a match),each matching Ensembl sequence was then realigned, using the FASTAprogram (Pearson et al. 1997

), to the region of genomic DNA correspondingto the pseudogene sequence but expanded on either side by thelength of the Ensembl sequence (in nucleotides). Any pseudogenethat was lengthened as a result of this realignment was replacedwith the new Ensembl-derived sequence, and the new updated listof pseudogenes was reduced for overlap, as above. We then recheckedthrough the complete set of pseudogenes for overlap with knowngenes as describedpreviously.

Assess for Processing

We inspected the genomic DNA around the potential pseudogenes for any evidence of exon structure from existing Riken or SangerCenter gene and pseudogene parsing, from gene annotations madeusing the program GenomeScan (Yeh et al. 2001

), or fromevidence of exon structure from our own pseudogene analysis. Forthis third option, we considered an intron to occur if the gapin the sequence was >126 nt (from inspection of the distributionof intron lengths for known genes of chromosome 22, only 5% ofintrons would be shorter than this). From our visual curation,we made a decision as to whether the predicted pseudogene fragmentwas part of a larger exon structure. We also checked for pseudogenicduplications of any single-exongenes.

For any potential pseudogene that does not have such evidence of exon structure, we used two lines of evidence for assessingwhether it was processed:

(1)	We labeled as candidate-processed pseudogenes all those matches that comprise >70% of the length of the closest-matching humanEnsembl or SWISSPROT database sequence in a continuous segment.This criterion was used previously by Venter et al. (Venter etal. 2001). We checked over the list for matches to known single-exonhuman genes (which, for example, comprise ~6% of the known geneson chromosome 22), that do not have any evidence for processingfrom the analysis for a polyadenine tail (seebelow).
	We checked the utility of this criterion for a set of 46 previously identified primate processed pseudogenes that we collatedfrom Genbank (Benson et al. 2000) in August 2000. Almost all ofthese processed pseudogenes (42/46, 91%) were detected by thiscriterion.
(2)	Recently processed pseudogenes can be identified through the existence of a polyadenine tail of at least 15-20 nt that ispreceded by a polyadenylation signal (AATAAA), usually about 15-20nt upstream. We searched a 1000-nt region that was 3' to the pseudogenehomology segment, with a sliding window of 50 nt for a regionof elevated polyadenine content (>30 nt), and picked the mostadenine-rich 50-nt segment as the most likely candidate. An intervalof 1000 nt was used because of the possible existence of 3'-untranslatedregions (3'-UTRs); 90% of 3'-UTRs are of length <942 nt (Makalowskiet al. 1996). In addition, we searched in the same 1000-nt regionfor candidate AATAAA polyadenylation signals and checked whetherthey were upstream to the candidate polyadenine tail site. Whena polyadenylation signal was found within 50 nt upstream of thecandidate polyadenine tail, the pseudogene is labeled as a "class1" candidate processed pseudogene and "class 2" if the signalis found between 51 and 100 nt upstream. The latter class mayarise if there is a genomic DNA insertion event; in reality, thereare very few of them found (eight in total). All other pseudogeneswith a detected candidate polyadenine tail are labeled as "class3." This last class (exclusively) only accounts for 17% of thecandidate-processed pseudogenes; their removal does not changethe main results and trends reported in thepaper.

There is considerable overlap between criteria (1) and (2) for detecting processing; about half (52%) of the candidate-processedpseudogenes assigned by criterion (1) are also classified as havingpolyadenylation. A quarter (26%) of the candidate-processed pseudogenesonly have evidence for polyadenylation (criterion [2]). Of theset of previously annotated primate-processed pseudogenes, downloadedfrom Genbank (see [1] above), 54% are detected to have evidenceofpolyadenylation.

All remaining pseudogene sequences were categorized as nonprocessed; some of these were merged into single nonprocessed pseudogenesfrom visual examination of the sequence fragmentsinvolved.

Analysis for Protein Function

Each pseudogene sequence was run through the InterPro sequence motif assignment package InterProScan (Apweiler etal. 2000). Functional categories were then assigned using theGO classification (Ashburner et al. 2000) with a list of correspondencesbetween InterPro motifs and GO functional classes that is availablefrom the InterPro Web site (http://www.ebi.ac.uk/interpro). Asubstantial proportion (~75%) of the pseudogene annotations wereassigned to an InterPro motif and ~45% were able to be mappedonto GO function classifications. The GO categories given by InterProwere merged into a higher level if they were judged to be toospecific, for example, all "receptors" are merged into one higherGO category. These proportions are at a level that is within therange of proportions of proteomes that have automatic reliablefunctional assignment in the GeneQuiz database (Hoersch et al.2000), a well-known standard of automated functional classification.We did not try to map pseudogenes that do not have InterPro motifsto the GO classification because this introduces an extra degreeof judgmentalbias.

	ACKNOWLEDGMENTS

We thank Ru-Fang Yeh and Chris Burge (MIT) for providing GenomeScan gene predictions and Sam Karlin (Stanford) fordiscussions. M.G. acknowledges support from the NIH Protein StructureInitiative (P50 grant GM62413-01).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734solely to indicate thisfact.

	FOOTNOTES

¹ Correspondingauthor.

E-MAIL Mark.Gerstein@yale.edu; FAX (360) 8387861.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.207102.

REFERENCES

TOP
ABSTRACT
INTRODUCTION
RESULTS AND DISCUSSION
CONCLUSIONS
METHODS
REFERENCES

	REFERENCES

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402[Abstract/Free Full Text].
Andersson, S.G., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten, T., Alsmark, U. C., Podowski, R.M., Naslund, A.K., Eriksson, A.S., Winkler, H.H., and Kurland, C.G. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396: 133-140[CrossRef][Medline].
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D. 2000. InterPro $---$ an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16: 1145-1150[Abstract].
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29[CrossRef][Medline].
Bairoch, A. and Apweiler, R. 2000. The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45-48[Abstract/Free Full Text].
Bedell, J.A., Korf, I., and Gish, W. 2000. MaskerAid: A performance enhancement to RepeatMasker. Bioinformatics 16: 1040-1041[Abstract].
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., and Wheeler, D.L. 2000. GenBank. Nucleic Acids Res. 28: 15-18[Abstract/Free Full Text].
Birney, E., Bateman, A., Clamp, M.E., and Hubbard, T.J. 2001. Mining the draft human genome. Nature 409: 827-828[CrossRef][Medline].
Cole, S.T., Eigimeier, K., Parkhill, J., James, K.D., Thomson, N.R., Wheeler, P.R., Honore, N., Garnier, T., Churcher, C., Harris, D. 2001. Massive gene decay in the leprosy bacillus. Nature 409: 1007-1011[CrossRef][Medline].
Crollius, H.R., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F. 2000. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25: 235-238[CrossRef][Medline].
Dunham, I., Shimizu, N., Roe, B.A., Chissoe, S., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J. 1999. The DNA sequence of human chromosome 22. Nature 402: 489-495[CrossRef][Medline].
Ewing, B. and Green, P. 2000. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 232: 232-233[CrossRef].
Gerstein, M. 1988. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins 33: 518-534[CrossRef].
Gerstein, M. 1997. A structural census of genomes: comparing bacterial eukaryotic, and archaeal genomes in terms of protein structure. J. Mol. Biol. 274: 562-576[CrossRef][Medline].
Glusman, G., Yanai, I., Rubin, I., and Lancet, D. 2001. The complete human olfactory subgenome. Genome Res. 11: 685-702[Abstract/Free Full Text].
Harrison, P., Echols, N., and Gerstein, M. 2001. Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans Genome. Nuc. Acids. Res. 29: 818-830[Abstract/Free Full Text].
Hattori, M., Fujiyama, A., Taylor, T.D., Watanabe, H., Yada, T., Park, H.S., Toyoda, A., Ishii, K., Totoki, Y., Choi, D.K. 2000. The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium. Nature 405: 311-319[CrossRef][Medline].
Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288: 147-164[CrossRef][Medline].
Hoersch, S., Leroy, C., Brown, N.P., Andrade, M.A., and Sander, C. 2000. The GeneQuiz Web server: Protein functional analysis through the Web. Trends Biochem. Sci. 25: 33-35[CrossRef][Medline].
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. 2001. Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium. Nature 409: 860-921[CrossRef][Medline].
Makalowski, W., Zhang, J., and Boguski, M. S. 1996. Comparative analysis of 1,196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6: 846-857[Abstract].
Mighell, A.J., Smith, N.R., Robinson, P.A., and Markham, A.F. 2000. Vertebrate pseudogenes. FEBS Lett. 468: 109-114[CrossRef][Medline].
Nei, M., Gu, X., and Sitnikova, T. 1997. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc. Natl. Acad. Sci. 94: 7799-7806[Abstract/Free Full Text].
Pearson, W.R., Wood, T., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46: 24-36[CrossRef][Medline].
Petrov, D.A. 2001. Evolution of genome size: New approaches to an old problem. Trends Genet. 17: 23-28[CrossRef][Medline].
Qian, J., Luscombe, N.M., and Gerstein, M.B. 2001. Protein family and fold occurrence in genomes: Power-law behavior and evolution model. J. Mol. Biol. 313: 673-681[CrossRef][Medline].
Uechi, T., Tanaka, T., and Kenmochi, N. 2001. A complete map of the human ribosomal protein genes: Assignment of 80 genes to the cytogenetic map and implications for human disorders. Genomics 72: 223-230[CrossRef][Medline].
Vanin, E.F. 1985. Processed pseudogenes: Characteristics and evolution. Annu. Rev. Genet. 19: 253-272[CrossRef][Medline].
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. 2001. The Sequence of the Human Genome. Science 291: 1304-1351[Abstract/Free Full Text].
Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266: 554-571[Medline].
Wright, F.A., Lemon, W.J., Zhao, W.D., Sears, R., Zhuo, D., Wang, J.-P., Yang, H.-Y., Baer, T., Stredney, D., Spitzner, J. 2001. A draft annotation and overview of the human genome. Genome Biol. 2: research0025.1-0025.18.
Yeh, R.-F., Lim, L. P., and Burge, C. 2001. Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816[Abstract/Free Full Text].

Received July 23, 2001; accepted in revised form November 28, 2001.

This article has been cited by other articles:

L. Z. Strichman-Almashanu, M. Bustin, and D. Landsman
Retroposed Copies of the HMG Genes: A Window to Genome Dynamics
Genome Res., May 1, 2003; 13(5): 800 - 812.
[Abstract] [Full Text] [PDF]

J. L. Rinn, G. Euskirchen, P. Bertone, R. Martone, N. M. Luscombe, S. Hartman, P. M. Harrison, F. K. Nelson, P. Miller, M. Gerstein, S. Weissman, and M. Snyder
The transcriptional activity of human Chromosome 22
Genes & Dev., February 15, 2003; 17(4): 529 - 540.
[Abstract] [Full Text] [PDF]

P. M. Harrison, D. Milburn, Z. Zhang, P. Bertone, and M. Gerstein
Identification of pseudogenes in the Drosophila melanogaster genome
Nucleic Acids Res., February 1, 2003; 31(3): 1033 - 1037.
[Abstract] [Full Text] [PDF]

J. E. Collins, M. E. Goward, C. G. Cole, L. J. Smink, E. J. Huckle, S. Knowles, J. M. Bye, D. M. Beare, and I. Dunham
Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22
Genome Res., January 1, 2003; 13(1): 27 - 36.
[Abstract] [Full Text] [PDF]

S. Karlin, C. Chen, A. J. Gentles, and M. Cleary
Associations between human disease genes and overlapping gene groups and multiple amino acid runs
PNAS, December 24, 2002; 99(26): 17008 - 17013.
[Abstract] [Full Text] [PDF]

Z. Zhang, P. Harrison, and M. Gerstein
Identification and Analysis of Over 2000 Ribosomal Protein Pseudogenes in the Human Genome
Genome Res., October 1, 2002; 12(10): 1466 - 1482.
[Abstract] [Full Text] [PDF]

M. Vallee, F. Guay, D. Beaudry, J. Matte, R. Blouin, J.-P. Laforest, M. Lessard, and M.-F. Palin
Effects of Breed, Parity, and Folic Acid Supplement on the Expression of Folate Metabolism Genes in Endometrial and Embryonic Tissues from Sows in Early Pregnancy
Biol. Reprod., October 1, 2002; 67(4): 1259 - 1267.
[Abstract] [Full Text] [PDF]

A. M. Roy-Engel, A.-H. Salem, O. O. Oyeniran, L. Deininger, D. J. Hedges, G. E. Kilroy, M. A. Batzer, and P. L. Deininger
Active Alu Element "A-Tails": Size Does Matter
Genome Res., September 1, 2002; 12(9): 1333 - 1344.
[Abstract] [Full Text] [PDF]

N. Echols, P. Harrison, S. Balasubramanian, N. M. Luscombe, P. Bertone, Z. Zhang, and M. Gerstein
Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes
Nucleic Acids Res., June 1, 2002; 30(11): 2515 - 2523.
[Abstract] [Full Text] [PDF]

Abstract of this Article ()

Reprint (PDF) Version of this Article

Email this article to a friend

Similar articles found in:
Genome Online
PubMed

PubMed Citation

This Article has been cited by:

Search Medline for articles by:
Harrison, P. M. || Gerstein, M.

Alert me when:
new articles cite this article

Download to Citation Manager

				J. L. Rinn, G. Euskirchen, P. Bertone, R. Martone, N. M. Luscombe, S. Hartman, P. M. Harrison, F. K. Nelson, P. Miller, M. Gerstein, S. Weissman, and M. Snyder The transcriptional activity of human Chromosome 22 Genes & Dev., February 15, 2003; 17(4): 529 - 540. [Abstract] [Full Text] [PDF]

				P. M. Harrison, D. Milburn, Z. Zhang, P. Bertone, and M. Gerstein Identification of pseudogenes in the Drosophila melanogaster genome Nucleic Acids Res., February 1, 2003; 31(3): 1033 - 1037. [Abstract] [Full Text] [PDF]

				J. E. Collins, M. E. Goward, C. G. Cole, L. J. Smink, E. J. Huckle, S. Knowles, J. M. Bye, D. M. Beare, and I. Dunham Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22 Genome Res., January 1, 2003; 13(1): 27 - 36. [Abstract] [Full Text] [PDF]

				S. Karlin, C. Chen, A. J. Gentles, and M. Cleary Associations between human disease genes and overlapping gene groups and multiple amino acid runs PNAS, December 24, 2002; 99(26): 17008 - 17013. [Abstract] [Full Text] [PDF]

				Z. Zhang, P. Harrison, and M. Gerstein Identification and Analysis of Over 2000 Ribosomal Protein Pseudogenes in the Human Genome Genome Res., October 1, 2002; 12(10): 1466 - 1482. [Abstract] [Full Text] [PDF]

				M. Vallee, F. Guay, D. Beaudry, J. Matte, R. Blouin, J.-P. Laforest, M. Lessard, and M.-F. Palin Effects of Breed, Parity, and Folic Acid Supplement on the Expression of Folate Metabolism Genes in Endometrial and Embryonic Tissues from Sows in Early Pregnancy Biol. Reprod., October 1, 2002; 67(4): 1259 - 1267. [Abstract] [Full Text] [PDF]

				A. M. Roy-Engel, A.-H. Salem, O. O. Oyeniran, L. Deininger, D. J. Hedges, G. E. Kilroy, M. A. Batzer, and P. L. Deininger Active Alu Element "A-Tails": Size Does Matter Genome Res., September 1, 2002; 12(9): 1333 - 1344. [Abstract] [Full Text] [PDF]

				N. Echols, P. Harrison, S. Balasubramanian, N. M. Luscombe, P. Bertone, Z. Zhang, and M. Gerstein Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes Nucleic Acids Res., June 1, 2002; 30(11): 2515 - 2523. [Abstract] [Full Text] [PDF]

LETTER Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22

LETTER
Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22