Nucl. Acids. Res. -- Jansen et al. 31 (8): 2242

Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models

Ronald Jansen¹, Harmen J. Bussemaker³ and Mark Gerstein¹^,2

¹ Department of Molecular Biophysics and Biochemistry, ² Computer Science, 266 Whitney Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA, ³ Department of Biological Sciences and Center for Computational Biology and Bioinformatics, Columbia University, 1212 Amsterdam Avenue MC2441, New York, NY 10027, USA

Ronald Jansen, Computational Biology Center, Memorial Sloan-Kettering Cancer Center, 307 East 63rd Street, New York, NY 10021, USA

Received August 5, 2002; Revised January 23, 2003; Accepted February 18, 2003

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Highly expressed genes in many bacteria and small eukaryotesoften have a strong compositional bias, in terms of codon usage.Two widely used numerical indices, the codon adaptation index(CAI) and the codon usage, use this bias to predict the expressionlevel of genes. When these indices were first introduced, theywere based on fairly simple assumptions about which genes aremost highly expressed: the CAI was originally based on the codoncomposition of a set of only 24 highly expressed genes, andthe codon usage on assumptions about which functional classesof genes are highly expressed in fast-growing bacteria. Giventhe recent advent of genome-wide expression data, we shouldbe able to improve on these assumptions. Here, we measure, inyeast, the degree to which consideration of the current genome-wideexpression data sets improves the performance of both numericalindices. Indeed, we find that by changing the parameterizationof each model its correlation with actual expression levelscan be somewhat improved, although both indices are fairly insensitiveto the exact way they are parameterized. This insensitivityindicates a consistent codon bias amongst highly expressed genes.We also attempt direct linear regression of codon compositionagainst genome-wide expression levels (and protein abundancedata). This has some similarity with the CAI formalism and yieldsan alternative model for the prediction of expression levelsbased on the coding sequences of genes. More information isavailable at http://bioinfo.mbb.yale.edu/expression/codons.

	ABSTRACT

INTRODUCTION

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

It is well known that highly expressed genes exhibit a strongbias for particular codons in many bacteria and small eukaryotes.One suggested explanation is the observation that there appearsto be a relationship between tRNA abundance and codon bias (1–3).Several reviews on this topic have been published previously(4,5).

	INTRODUCTION

In 1987, the ‘codon adaptation index’ (CAI) wasproposed as a quantitative way of predicting the expressionlevel of a gene based on its codon sequence (1). More recently,the ‘codon usage’ was introduced as an alternativequantitative indicator (3). It also uses the occurrence of codonsin a gene sequence to predict whether genes are likely to behighly expressed, although the formalism is quite differentfrom the one used for the CAI. A related method, the codon biasformalism, is based on similar principles (6).

Expression level indicators such as these are widely used andare important in a variety of contexts. First, there is theannotation of genome sequences. The expression level indicatorscan serve as one of the variables to determine how likely thetranscription and translation of an open reading frame (ORF)into a protein product is. Secondly, in heterologous gene expression,the codon-based expression indicators are helpful for findingthe codon sequences that are most likely to yield high expression.The codon-based expression indicators and related methods arealso often used as convenient ‘rules of thumb’ inother applications.

Given that the codon-based expression models have these importantapplications, it is perhaps surprising that they are still basedon rather qualitative assumptions about gene expression. Forinstance, the parameters underlying the CAI model rely on thecodon composition of only a limited set of highly expressedgenes; to define the parameters in the CAI model (see below),Sharp and Li counted the codon frequency in only 24 highly expressedgenes (1). About half of these genes are ribosomal; the remainingones are mostly metabolic enzymes.

In the codon usage model, the parameters are based on a somewhatbroader set of highly expressed genes. The codon usage modelhas mainly been applied to fast growing bacteria, for which,as Karlin et al. have shown, it is a reasonable assumption thatribosomal genes, chaperones, and translation processing factorsare highly expressed (7,8).

In summary, the codon-based expression models are based on qualitativeestimates of the expression levels of limited gene sets. Butsince these models were first proposed, several quantitativeexpression data sets, covering the majority of genes in a genome,have become available. This raises the natural question whetherwe could improve the parameters of the codon-based expressionindicators by considering larger sets of genes with more accurateexpression data. We present the results of such a procedurehere, using the expression information available for the organismyeast.

In the following sections we briefly recap the CAI and codonusage formalisms. Later, we show how to calculate new parametersfor these models. We also propose an alternative linear modelto predict the expression levels from the codon compositionof genes.

The CAI model
The CAI model assigns a parameter, termed ‘relative adaptiveness’by Sharp and Li, to each of the 61 codons (stop codons excluded)(1). The relative adaptiveness of a codon is defined as itsfrequency relative to the most often used synonymous codon;note that this parameter is computed from a set of highly expressedgenes G (we leave aside the question of how to define this setof genes for now). It is given by:

where f_aa,i is the frequency of codon i (which encodes aminoacid aa), and f_aa,max the frequency of the codon most oftenused for encoding amino acid aa in a set of highly expressedgenes G. The relative adaptiveness parameter w_aa,i ranges from0 to 1, with 0 indicating that a codon is not present at allin G, and 1, a codon that occurs most often in G for a givenamino acid.

The CAI of a gene g is then simply the geometric average ofthe relative adaptiveness of all codons in a gene sequence:

Here, w_i is the relative adaptiveness of the ith codon in agene with N codons. This formula can be transformed into:

where w_k now represents the relative adaptiveness of the kthout of the 61 codons in the genetic code (excluding stop codons);X_k,g is the fraction of codon k among the total number of codonsin gene g:

where C_k,g is the number of times codon k appears in gene g.Note that w_k = w_k(G) in equation 3 is dependent on the set ofhighly expressed genes G.

Like the relative adaptiveness, the CAI also ranges from 0 to1. Higher CAI values indicate genes that are more likely tobe highly expressed.

The codon usage model
Karlin et al. define the ‘codon bias’ of a geneg relative to a gene set G as (4):

where p_aa(f) is the fraction of amino acid aa in gene g; f(x,y, z) the frequency of a codon triplet (x, y, z) in gene g normalizedsuch that f(x, y, z) = 1 if (x, y, z) is the most common synonymouscodon; g(x, y, z) is the corresponding normalized codon frequencyin gene set G. Equation 5 is written in the notation of Karlinet al. We can rewrite equation 5 in our own notation as follows:

where X_k,g and X_k,G are defined as in equation 4. Note thatk has replaced (x, y, z) as the summation index. Given thesedefinitions, Karlin et al. define an expression level measureE(g) as follows (8):

where the gene set C comprises all genes in the genome, RP theribosomal proteins, Ch chaperones, and Tf translation processingfactors. E(g) is close to zero if gene g has a codon compositionclose to the average composition of the genome [E(g) $->$ 0 becauseB(g|C) $->$ 0], while E(g) would take on very large values if thecodon composition of gene g is close to the composition of ribosomalgenes, chaperones and translation processing factors [E(g) >>1 because B(g|RP), B(g|Ch), B(g|Tf) $->$ 0]. The idea is that highlyexpressed genes tend to have higher values of E than lowly expressedgenes.

Karlin et al. have shown that highly expressed genes can bestbe differentiated from lowly expressed genes in the multidimensionalspace of the different codon bias terms B(g|RP), B(g|Ch) andB(g|Tf) (8). However, in this study, we use the simplified expressionmeasure E(g|G), defined as:

where G is a set of highly expressed genes. Thus, E is dependenton the set G that can be chosen in different ways. In otherwords, the parameters of the model are the 61 codon fractionsX_k,G in the gene set G (see equation 6).

Given this formal description of the CAI and the codon usage,the question is how we can use the genome-wide expression datato optimize the 61 parameters in the two models with respectto the prediction of expression levels.

MATERIALS AND METHODS

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Expression data
We give an overview of the expression data we used in this studyin Supplementary Material, Table S1. Briefly, we have combineddifferent publicly available Affymetrix gene chip and SAGE datasets into one reference mRNA expression data set, and two publiclyavailable 2D-gel electrophoresis data sets into one referenceprotein abundance data set (9–14). We have described thisprocedure, which helps to remove noise and errors from the data,previously (15). The codon composition of genes fundamentallyaffects the mechanism of protein translation; thus, the proteinabundance data might contain more useful information than themRNA expression data. On the other hand, the protein abundancedata are available only for a very limited subset of 150 geneswhile there is a substantially larger amount of mRNA expressiondata (6071 genes). [For our calculations, we only consideredthose genes in the reference mRNA expression set that have anexpression level of more than 0.5 copies/cell—this isthe case for 4270 genes. Smaller expression levels are too closeto the resolution limits of the gene chips and therefore toonoisy (see also captions of Tables 1 and 2)].

	MATERIALS AND METHODS

View this table:
[in this window]
[in a new window]

Table 1. The Pearson and rank correlation of the original CAI and codon usage models with various evaluation sets of expression data

View this table:
[in this window]
[in a new window]

Table 2. The Pearson and rank correlations of the CAI and codon usage models based on the new parameters

As described previously (15), we term the combination of a geneset (with G_Prot referring to the protein abundance and G_mRNAto the mRNA expression reference set) and an expression levelor weight (a_Prot for protein and a_mRNA for mRNA abundance) ‘weightedpopulation’. Thus, three different weighted populationscan be formed from our reference data sets: [G_Prot, a_Prot],[G_Prot, a_mRNA], and [G_mRNA, a_mRNA]. ([G_mRNA, a_Prot] is not meaningfulsince a_Prot is not defined on all genes in G_mRNA.) In the followingwe use all three populations for the parameterization of theCAI and the codon usage models.

Parameterization of the CAI and codon usage models with whole-genome expression data
Figure 1 schematically shows the procedure we used to parameterizethe CAI and codon usage models with the expression data. Westart by selecting one of the three populations mentioned aboveas an evaluation set. The evaluation set is later used to evaluatehow well the CAI or codon usage model predicts actual expressionlevels. We also need to define a parameterization set. The parameterizationset is the set of highly expressed genes G (see Introduction);it is used to calculate the parameters w_k(G) for the CAI (seeequation 3) and the parameters X_k,G for the codon usage (seeequation 6). To define the parameterization set, we choose oneof the three populations and an expression level threshold T.We only include those genes of the population in the parameterizationset whose expression level exceeds this threshold. With theparameters in hand, we are able to compute CAI and codon usagevalues for all genes in the evaluation set. We evaluate howwell the CAI and codon usage models predict expression levelswith two figures of merit: the Pearson correlation and the Spearmanrank correlation. {Given a set of abundance levels a in theevaluation set, and a vector of CAI or codon usage values (C),we calculate the Pearson correlation as corr[log(a),log(C)]and the rank correlation as corr[rank(a),rank(C)]}.

View larger version (15K):
[in this window]
[in a new window]

Figure 1. Our general procedure for the parameterization of the CAI and codon usage models. We first choose an expression data set and an arbitrary expression level threshold T to differentiate highly from lowly expressed genes. The highly expressed genes with expression levels greater than T define the parameterization set. Based on this we calculate new model parameters. Finally, to evaluate the performance of the models, we choose another expression data set (we term this the evaluation set): we calculate the CAI and codon usage values for all genes in the evaluation set and then measure the correlation between the model values and the actual expression levels as a figure of merit.

We use the rank correlation as an additional diagnostic to the(linear) Pearson correlation because the relationship betweenCAI or codon usage values and expression levels is of a non-linearnature (see Supplementary Material).

We can iterate the procedure by changing the expression levelthreshold T and repeating the subsequent steps until we arriveat an optimal figure of merit. This gives us optimal parametersfor the CAI and codon usage models.

Example of the CAI parameterization
Figure 2 shows a specific example of the parameterization ofthe CAI with [G_Prot, a_Prot] as both the parameterization andevaluation population and illustrates how the figure of merit(Pearson correlation of the CAI values and the evaluation set)changes as a function of the expression level threshold T. Whenthe threshold reaches T = 66 200 protein copies/cell the Pearsoncorrelation reaches a maximum. At this point, there are only21 genes in the parameterization set. The maximum correlationis slightly greater than the correlation between the CAI basedon the original parameters by Sharp and Li (1) and the sameevaluation set.

View larger version (20K):
[in this window]
[in a new window]

Figure 2. An example of the parameterization of the CAI with expression data. Here, we use [G_Prot, a_Prot] for both the parameterization and the evaluation steps. The Pearson correlation of the CAI with the evaluation set (left ordinate) is shown as a function of the expression level threshold T, which defines the parameterization set of highly expressed genes. The right ordinate shows the number of genes in the parameterization set for a given threshold T. At T = 66 200 proteins/cell, the Pearson correlation reaches a maximum. This correlation is slightly higher than the correlation of the original CAI model with the evaluation set (dashed line).

Linear model
In addition to the determination of the parameters for the CAIand codon usage models it is also possible to relate expressionlevels and codon composition of genes more directly.

The CAI formalism itself, slightly modified, suggests a multivariatelinear model for doing this. Starting with equation 3, we cantake the logarithm on both sides to obtain:

If we introduce v_k ${equiv}$ log(w_k) and consider that the log(CAI) isrelated to the logarithm of the gene expression levels, we cansuggest the following linear model to predict the expressionlevel a_g of a gene g:

with the residuals

In equation 10, y_g is the predicted expression level, the codonfractions X_k,g are the predictor variables and v₀, ..., v₆₁the parameters. Note that we have introduced an intercept parameterv₀ in equation 10, for which there is no equivalent in equation9. We can then perform a standard multivariate linear regressionto estimate the model parameters v₀, ..., v₆₁ by minimizingthe deviance:

Reducing the number of parameters in the linear model. One problemof this regression approach is obviously the large number ofparameters. This may result in overfitting, even when the regressionis applied to the largest population [G_mRNA, a_mRNA], which contains4270 data points.

We avoided this problem by deriving a linear model that consistsof fewer parameters. This is done via a forward selection ofparameters, adding one predictor variable at a time (16). Asimilar procedure has previously been used in finding significantpromoter sequence motifs (17).

We start with a model of just one predictor variable (codonfraction X_k):

which gives the residuals:

and the deviance:

Note that the deviance is dependent on the codon k. This allowsus to find the codon that produces the smallest error and thusselect the first predictor variable. We add this codon to a‘model set’ M.

Then we iterate this procedure. Given a model set M of codonswith optimal parameter estimates, the linear model is:

This model gives the new residuals:

We then choose the next predictor variable by finding the codonk that minimizes:

This codon is then added to model set M, and we iterate theprocedure described in equations 16–18. Note that theinterpretation of equation 18 is that the optimal predictorvariable is orthogonal to the linear model of equation 16.

Significance of predictor variables. Each time we add a newpredictor variable to the model, we need to check whether thecorresponding parameter is significant. We can do this by observingthe t statistic for a parameter estimate v_k. The ratio of aparameter estimate to its standard deviation follows a t-distributionand a P-value based on this distribution can be used for testingthe hypothesis that v_k = 0. The t statistic and its correspondingP-values can be gathered from the standard output of a linearregression when performed in various statistical software packages(here, we used the publicly available R statistical computingenvironment, http://www. r-project.org/, as well as MATLAB forthese computations).

To accept a predictor variable as significant we required thatthe P-value of the t statistic stay below ${alpha}$ = 0.05. Since wewere choosing from several possible predictor variables at eachstep, a Bonferroni correction is necessary for this statisticaltest. This is equivalent to multiplying the P-value for a parameterwith the number of remaining possible predictor variables. Giventhat there are already N_M parameters in the model set M, wehave a choice of 61 – N_M remaining predictor variables,and the condition for significance thus becomes:

P' = (61 – N_M)P < ${alpha}$ 19

RESULTS

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Parameterization of the CAI and codon usage models
Table 1 shows the performance of the CAI and the codon usagewith the original parameters in terms of the Pearson and rankcorrelation with the expression data. Here, the CAI parameterswere taken from the original publication by Sharp and Li (1),which stem from 24 highly expressed genes. The situation isa little bit more complicated for the codon usage, in that previouslythe codon usage had not been explicitly used for the predictionof expression levels in yeast, but only in prokaryotes. However,to come up with a set of ‘original’ parameters,we computed them from the set of 128 ribosomal genes, followingthe recommendation of Karlin et al. who showed that, in yeast,ribosomal proteins exhibit the largest codon bias amongst allgene classes (4).

	RESULTS

Table 2 generalizes the example shown in Figure 1 by listingall possible evaluation and parameterization populations forboth the CAI and the codon usage. Note that the parameters ofthe CAI and the codon usage are in each case dependent on theparameterization population and the expression level thresholdT. (The threshold T defines the number of ORFs with expressionlevels greater than T.) The table shows the maximum Pearsonand rank correlations that can be achieved by varying T, theincrease of the correlation compared with the original models(‘ ${Delta}$ correlation’), and the size of the parameterizationset at the maximum (rank) correlation, measured in number ofORFs.

A mixed picture emerges from this comprehensive collection ofstatistics. In many cases the new parameters improve the performanceof the CAI and the codon usage (gray and black shaded squaresin Table 2), but sometimes the performance is also slightlylower.

The codon usage models with the new parameters generally performbetter than the model with the original parameters ( ${Delta}$ correlationis >1% six out of nine times for both the Pearson and rankcorrelations), whereas the improvements for the CAI are lessobvious ( ${Delta}$ correlation >1% three out of nine times for boththe Pearson and rank correlations).

One important observation is that the parameterization setsfor which we found optimal parameters are usually very small(on the order of 100 genes or less) for both the CAI and thecodon usage. This is despite the fact that we used whole-genomeexpression data in our calculations. An extreme example is thecodon usage with parameterization population [G_Prot, a_Prot]and evaluation population [G_Prot, a_mRNA]: here, the optimalparameterization set contains only one gene (the phosphopyruvatehydratase ENO2). This alone yields a rank correlation of 0.66with the expression data.

Linear model
We fitted the linear model of equation 16 to the population[G_mRNA, a_mRNA] according to the iterative procedure describedin the Materials and Methods. We tested models ranging fromone to 61 codons (= predictor variables). The largest modelfor which all parameters were significant was a model with 20codons. (The results for each model are shown in the SupplementaryMaterial.) The values of these 20 codon parameters are shownin Figure 3. We have only used [G_mRNA, a_mRNA] as the parameterizationset because the other possible populations are too small (150genes) relative to the possible number of parameters. When weused the reduced parameter procedure with [G_Prot, a_Prot] or[G_Prot, a_mRNA] as the parameterization populations, we foundthat linear models with only two predictor variables are alreadysuperseding the critical P-value of 5% (see Materials and Methods),thus making them of little use for predicting expression levels.

View larger version (41K):
[in this window]
[in a new window]

Figure 3. Shows which codons are common in highly expressed genes. There are four columns for each codon. The first two columns show the relative adaptiveness values for the CAI and codon usage (CU) according to equation 1. The third column shows the regression parameters of the LM. Note that there are only 20 values because the model contains only 20 codons as predictor variables. The fourth columns shows the relative adaptiveness values for the genome as a whole. The relative adaptiveness values are normalized to 100 for the most frequent synonymous codons. The regression parameters are not normalized.

The 20 codons that are significant predictor variables in thelinear model for [G_mRNA, a_mRNA] represent 13 different aminoacids (see Fig. 3 ). Of the seven remaining amino acids, fiveare under-represented in highly expressed genes (Asp, His, Ile,Met and Tyr) while two of them are roughly equally representedin highly and lowly expressed proteins (15,18). Four of the20 chosen predictor variables (= codon compositions) are negativelycorrelated with expression levels. The parameters of the linearmodel and corresponding codons (= predictor variables) are discussedin more detail in the next section. Details of the regressionresults (parameters, P-values, etc.) can be found in the SupplementaryMaterial.

The bottom of Table 2 shows the performance of the linear modelcompared with the CAI and codon usage. There is no possiblecomparison to a set of original parameters, as in the case ofthe CAI and the codon usage. Instead, we compared the performanceof the linear model with the performance of the original CAIand codon usage models on the same evaluation sets. The lefthalf of the ‘ ${Delta}$ correlation’ column in Table 2 refersto the difference with the CAI correlation, whereas the righthalf gives the difference with the codon usage correlation.(There are three possible choices for the evaluation set.) Itis clear from the results that the best performance is obtainedwhen the parameterization and evaluation populations are both[G_mRNA, a_mRNA]. (This should be expected, given that the modelparameters were optimized on this set.)

When [G_mRNA, a_mRNA] is both the parameterization and evaluationpopulation, the Pearson correlation of the linear model withthe expression data is 0.75. This is slightly higher than thebest Pearson correlations for the CAI and codon usage models.(The CAI has a maximum Pearson correlation of 0.72, while thecodon usage has a maximum Pearson correlation of 0.71.) In termsof the rank correlation, the best codon usage model is somewhatbetter than the linear model (0.60 versus 0.56), while the CAIperforms worse than both of the other methods (0.46).

Preferential codons in yeast
As mentioned at the beginning, it is important for heterologousgene expression to encode proteins with sequences that yieldoptimal expression. A good rule of thumb for finding such anoptimal sequence is to choose codons that are most frequentin highly expressed genes. The CAI model provides an explicitway of finding such codons; the most frequent codons simplyhave the highest relative adaptiveness values, and sequenceswith higher CAIs are preferred over those with lower CAIs. Thecodon usage formalism does not explicitly use relative adaptivenessvalues, but they can be easily calculated with equation 1 fromthe parameterization sets that yield optimal codon usage parameters.A third possibility is to look at the parameters of the linearregression with respect to which codons are more preferred.(This is of course only possible for those codons that are predictorvariables in the linear model.)

Figure 3 shows the relative adaptiveness values for the CAIand codon usage—when the parameterization and evaluationpopulations are both [G_Prot, a_Prot] with the Pearson correlationas the figure of merit—together with the parameter valuesof the linear regression (LM) with [G_mRNA, a_mRNA]. For comparison,we also show the relative adaptiveness values for the genomeas a whole. Codons with relative adaptiveness values of 100%(= preferential codons) are shown in black. It is evident thatboth the CAI and the codon usage give the same preferentialcodons.

The relative adaptiveness values for the CAI are computed fromthe 21 most abundant proteins in a_Prot, whereas the codon usagevalues stem from the four most abundant proteins (see Table2). Note that the preferential codons for both the CAI and thecodon usage stay the same regardless of which parameterizationand evaluation sets we choose (with the Pearson correlationas the figure of merit). The only exception is when we choose[G_mRNA, a_mRNA] as both the parameterization and evaluation setfor the codon usage. In that case, the optimal parameterizationset becomes relatively large (253 ORFs) such that several ofthe preferential codons are the same as the ones for the genomeas a whole.

The parameters of the linear model are shown in the third columnfor each codon in Figure 3. Note that the parameters v_k givethe expected change of expression level for an increase in thecomposition of the corresponding codon k, given that the compositionof the other codons in the model stays the same:

One would expect the regression parameters to roughly correlatewith the relative adaptiveness values of the CAI and codon usage.Because the number of parameters in the linear model is lessthan the total number of codons, this comparison is only possiblefor synonymous codons of seven amino acids (see Fig. 3).

Contrary to our expectation, the rank order of the regressionparameters was different than that of the relative adaptivenessvalues of the CAI and codon usage for three of these seven aminoacids (Val, Cys and Arg). One (non-biological) explanation forthis different order might be the sensitivity of the parameters.This is in fact the case for Val and Cys where the 95% confidenceintervals of the parameter values overlap (see SupplementaryMaterial). However, parameter sensitivity does not explain thedifferent codon order for arginine; the codon CGT has a muchhigher parameter value than the codon AGA (9.7 as opposed to4.7), contrary to the ranking of relative adaptiveness values(see Fig. 3).

We suggest the following explanation: in contrast to the linearmodel parameters, the relative adaptiveness values describethe global enrichment of a codon in highly expressed genes withno restrictions on the compositions of the other codons. (Thisis confirmed by the fact that the Pearson correlation betweenthe logarithms of a_mRNA and the codon composition of AGA islarger than that between a_mRNA and CGT). Thus, in the case ofarginine, the reason for the discrepancy between the linearmodel and the CAI/codon usage might be that yeast cells preferentiallyuse AGA codons for arginine in highly expressed genes (explainingthe CAI value), but that the supply of the corresponding tRNAis already strongly exhausted for fast growing cells. Thus,to achieve additional translation of arginine at high rates,the cell might need to use the supply of another tRNA for arginine(explaining the higher regression parameter for AGA). Note thatthe tRNA gene copy number is 11 for the AGA codon and 6 forthe CGT codon (the highest and second highest among all argininecodons). This way, the cell would make optimal use of the supplyof arginine tRNAs when it is already growing fast.

DISCUSSION

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Quantitative versus qualitative, genome-wide versus few genes
The CAI and codon usage models are originally based on somewhatqualitative assumptions about the expression levels of relativelyfew genes. This was our motivation for using quantitative, genome-wideexpression data to recalculate optimal model parameters. Thesenew parameters sometimes lead to a slightly better correlationof the codon-based expression models with expression data accordingto several measures, although the improvements are marginaland the results are mixed.

	DISCUSSION

Small parameterization sets are sufficient
Furthermore, the parameterization sets that yielded optimalparameters for the CAI and codon usage are often very smallcompared to the number of genes in the genome—very muchin the same way that the original parameterization sets weresmall (see Table 1). Thus, very few highly expressed genes seemto be sufficient to describe the overall codon bias in yeast.This shows that the original procedures for determining theparameters of the CAI and codon usage were indeed quite prescient.The CAI and codon usage models are relatively insensitive tothe exact choice of highly expressed genes.

One explanation for this observation might be that althoughthe optimal parameterization sets are small compared to thesize of the genome, their share of the overall number of transcriptsand protein copies in the cell is much larger; they may in factdominate the overall codon composition of transcripts and proteins(18). This situation can be compared with the way a financialmarket index, composed of very few stocks with very high marketcapitalization, can be a very good approximation for the valueof a total market, which consists of perhaps thousands of individualstocks.

Thus, to obtain robust parameters for the CAI and codon usagemodels, it often seems sufficient to infer them from ratherqualitative information about gene expression levels. For instance,it may be enough to infer from information about biologicalfunction whether a group of genes is highly expressed. Notethat, using our parameterization procedure, we achieved a Pearsoncorrelation of 0.72 between the codon usage model and the expressiondata (when both the evaluation and parameterization populationare [G_mRNA, a_mRNA], see Table 2). This is only a marginal improvementover the original parameters (Pearson correlation 0.71, seeTable 1) that were derived from the codon composition of the128 ribosomal proteins in yeast.

Comparison of the CAI, codon usage and linear models
In contrast to the linear model and the codon usage, the parametersof the CAI are normalized by synonymous codon usage, a constraintthat is not present in the other two models. It is thereforeremarkable that the CAI model (given the best parameterizationset) usually performs as well as the other two models. The onlynotable exception to this general rule is perhaps the relativelylow rank correlation of the CAI with [G_mRNA, a_mRNA], which isonly 0.49 under the best circumstances (compared with 0.60 forthe codon usage and 0.56 for the linear model).

The linear model achieves the highest Pearson correlation (0.75)with [G_mRNA, a_mRNA], while the comparable values for the CAIand codon usage are slightly lower (0.72 and 0.71).

Can the models be improved?
The main motivation of our study was the question whether itwould be possible to improve on existing and commonly used codon-basedmodels for predicting expression levels. The results showedthat the original models are relatively robust to the exactway they are parameterized. Perhaps such models could stillbe improved if other protein properties were included as additionalfeatures in the prediction.

We have explicitly tested whether one protein property, namelyprotein length, can aid in improving the prediction performance.It has previously been observed that longer proteins often tendto be less highly expressed than shorter ones (18,19). For instance,in the linear regression model one could explicitly considerprotein length by replacing the codon fractions X_k with thenumber of codons (equation 16). However, we found that thisseverely decreases the correlation between the model predictionsand actual expression data (data not shown).

Codon composition is often the strongest predictor of expression levels
Pavesi (20) proposed a model for predicting expression levelsbased on several different protein properties (the CAI, thecodon bias index, an entropy score relating to synonymous codonusage, a TATA-box score and a pyrimidine bias index) (21). Heshowed in a regression analysis that the two significant parametersof the model were the CAI and the entropy score, both measuresrelating to synonymous codon usage. Pavesi reported a Pearsoncorrelation of 0.76 with a select set of 621 expression levelsderived from SAGE data.

Linear model
As an alternative to the CAI and codon usage models, we haveproposed a simple linear model that relates codon fractionsand expression levels of genes. An advantage of the linear modelis that, unlike the numerical values from the CAI and the codonusage, the predicted expression levels have the same dimensionas the logarithm of the actual expression levels and are directlycomparable with them. The linear model predicts an expressionlevel of 1.7 copies/cell for transcripts from sequences withaverage codon fractions; this is equal to the average expressionlevel in a_mRNA. (This follows from equation 11 and the factthat the average residual in the model is equal to zero.)

We have suggested a natural, intuitive justification for thelinear model, based on the CAI formalism. Of course there mightbe better alternatives than the linear model. From a mathematicalstandpoint, the linear regression is relatively simple and involvesmuch less complex computations than non-linear regressions.

Applications
Overall, it seems justified to use the CAI, codon usage or relatedmeasures as ‘rules of thumb’ in a variety of applicationssuch as heterologous gene expression, either based on the originalparameters or on our newly optimized ones. For the annotationof genomes, all three models seem to be useful, however, theyshould of course only be used in conjunction with other gene-findingcriteria (22).

The 20-parameter linear model allows us to compare the codonparameters for seven amino acids. Surprisingly, the linear modelparameters suggest a different rank order for the codons ofthe amino acid arginine. We have suggested the explanation thatfast growing yeast cells have already exhausted the supply ofthe most abundant tRNA, and thus have to make use of the tRNAcorresponding to the second best codon.

General issues of data quality
The value of the codon-based expression indicators can perhapsbe appreciated by comparing them to the correlation of mRNAand protein abundance data in general. The correlation for thetwo populations [G_Prot, a_mRNA] and [G_Prot, a_Prot] is 0.67, wellwithin the range of the correlations in Tables 1 and 2 (13–15).One interpretation of this is that the codon-based expressionindicators are actually just as good as mRNA expression dataas an approximation of protein abundance levels.

Of course, the codon-based expression indicators yield staticvalues, whereas gene expression is a dynamic process, with verydifferent expression levels under different conditions. Theexpression data that we used in this study stems from experimentsunder very similar conditions, that is, yeast cells in vegetativegrowth on rich media (9–12). Thus, the prediction of expressionlevels based on codon composition should work best for thesephysiological situations, but might work less well for others.Coghlan et al. have pointed to the example of ENO1 and ENO2,which both exhibit strong codon biases—the former is repressedby high glucose concentrations whereas the latter is stronglyinduced (19). In general, the regulation of translation mightbe less flexible than the regulation of transcription becausethe abundance of charged tRNAs cannot be changed as flexiblyas the abundance of transcription factors [there are 33 cognatetRNAs in yeast, but perhaps hundreds of transcription factors(23,24)].

Of course, there are many limitations of the expression dataitself that might confound the relationship between expressionlevels and codon composition. The 2D-gel data is subject tomany biophysical and biochemical constraints (13,14,25). Thesituation is somewhat better for the mRNA expression data, wherewe have more data resources that we combined in this study.

SUPPORTING WEBSITE

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Additional data relating to our analysis is available at: http://bioinfo.mbb.yale.edu/expression/codons.

	SUPPORTING WEBSITE

SUPPLEMENTARY MATERIAL

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

Supplementary Material is available at NAR Online.

	SUPPLEMENTARY MATERIAL

ACKNOWLEDGEMENTS

We thank M. Seringhaus for helpful discussions. M.G. acknowledgessupport from NIH grant P20 LM07253-01. H.J.B. was partly supportedby National Institutes of Health Grant 1P20LM007276-01.

	ACKNOWLEDGEMENTS

REFERENCES

TOP
ABSTRACT
INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
SUPPORTING WEBSITE
SUPPLEMENTARY MATERIAL
REFERENCES

	REFERENCES

Sharp,P.M. and Li,W.H. (1987) The codon adaptation index—a measure of directional synonymous codon usage bias and its potential applications. Nucleic Acids Res., 15, 1281–1295.[Abstract]
Ikemura,T. (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol., 151, 389–409.[Medline]
Karlin,S., Mrazek,J. and Campbell,A.M. (1998) Codon usages in different gene classes of the Escherichia coli genome. Mol. Microbiol., 29, 1341–1355.[CrossRef][Medline]
Karlin,S., Campbell,A.M. and Mrazek,J. (1998) Comparative DNA analysis across diverse genomes. Annu. Rev. Genet., 32, 185–225.[Abstract/Free Full Text]
Sharp,P.M. and Matassi,G. (1994) Codon usage and genome evolution. Curr. Opin. Genet. Dev., 4, 851–860.[Medline]
Bennetzen,J.L. and Hall,B.D. (1982) Codon selection in yeast. J. Biol. Chem., 257, 3026–3031.[Abstract/Free Full Text]
Karlin,S., Mrazek,J., Campbell,A. and Kaiser,D. (2001) Characterizations of highly expressed genes of four fast-growing bacteria. J. Bacteriol., 183, 5025–5040.[Abstract/Free Full Text]
Karlin,S. and Mrazek,J. (2000) Predicted highly expressed genes of diverse prokaryotic genomes. J. Bacteriol., 182, 5238–5250.[Abstract/Free Full Text]
Holstege,F.C., Jennings,E.G., Wyrick,J.J., Lee,T.I., Hengartner,C.J., Green,M.R., Golub,T.R., Lander,E.S. and Young,R.A. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell, 95, 717–728.[Medline]
Jelinsky,S.A. and Samson,L.D. (1999) Global response of Saccharomyces cerevisiae to an alkylating agent. Proc. Natl Acad. Sci. USA, 96, 1486–1491.[Abstract/Free Full Text]
Roth,F.P., Hughes,J.D., Estep,P.W. and Church,G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol., 16, 939–945.[Medline]
Velculescu,V.E., Zhang,L., Zhou,W., Vogelstein,J., Basrai,M.A., Bassett,D.E.,Jr, Hieter,P., Vogelstein,B. and Kinzler,K.W. (1997) Characterization of the yeast transcriptome. Cell, 88, 243–251.[Medline]
Gygi,S.P., Rochon,Y., Franza,B.R. and Aebersold,R. (1999) Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol., 19, 1720–1730.[Abstract/Free Full Text]
Futcher,B., Latter,G.I., Monardo,P., McLaughlin,C.S. and Garrels,J.I. (1999) A sampling of the yeast proteome. Mol. Cell. Biol., 19, 7357–7368.[Abstract/Free Full Text]
Greenbaum,D., Jansen,R. and Gerstein,M. (2002) Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics, 18, 585–596.[Abstract/Free Full Text]
Dobson,J.D. (1999) Applied Multivariate Data Analysis. Volume I: Regression and Experimental Design. Springer.
Bussemaker,H.J., Li,H. and Siggia,E.D. (2001) Regulatory element detection using correlation with expression Nature Genet., 27, 167–171.[CrossRef][Medline]
Jansen,R. and Gerstein,M. (2000) Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res., 28, 1481–1488.[Abstract/Free Full Text]
Coghlan,A. and Wolfe,K.H. (2000) Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast, 16, 1131–1145.[CrossRef][Medline]
Pavesi,A. (1999) Relationships between transcriptional and translational control of gene expression in Saccharomyces cerevisiae: a multiple regression analysis. J. Mol. Evol., 48, 133–141.[Medline]
Konopka,A. (1984) Is the information content of DNA evolutionarily significant? J. Theor. Biol., 107, 697–705.[Medline]
Kumar,A., Harrison,P.M., Cheung,K.H., Lan,N., Echols,N., Bertone,P., Miller,P., Gerstein,M. and Snyder,M. (2002) An integrated approach for finding overlooked genes in yeast. Nat. Biotechnol., 20, 58–63.[CrossRef][Medline]
Horak,S.E., Luscombe,N.M., Qian,J., Bertone,P., Piccirrillo,S., Gerstein,M. and Snyder,M. (2002) Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. Genes Dev., 16, 3017–3033.[Abstract/Free Full Text]
Lee,T.I., Rinaldi,N.J., Robert,F., Odom,D.T., Bar-Joseph,Z., Gerber,G.K., Hannett,N.M., Harbison,C.T., Thompson,C.M., Simon,I. et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 763–764.[Abstract/Free Full Text]
Gygi,S.P., Corthals,G.L., Zhang,Y., Rochon,Y. and Aebersold,R. (2000) Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc. Natl Acad. Sci. USA, 97, 9390–9395.[Abstract/Free Full Text]

Abstract of this Article (

)

Reprint (PDF) Version of this Article

[Supplementary Material]