Greenbaum et al 
17
 
Figure 1,  Schematic overview of the analysis 
On the left side we outline the terms we use to describe the process of gene expression.  The coding 
section of the genome is transcribed into a population of mRNA transcripts called the 
"transcriptome".  The transcripts in turn are translated to a population of proteins; we use the term 
"translatome" for this protein population rather than the alternative "proteome" because the latter 
term may be confounded with the protein complement of the genome (which is not necessarily 
associated with a quantitative abundance level). 
 
The matrix in the middle schematically shows an analysis of the three stages of expression.  In 
general, we define a protein "population" as a set of genes associated with a corresponding number 
of expression or abundance levels ("weights").  In the matrix each row represents a weight and each 
column a gene set.  In particular, we differentiate between the mRNA reference expression set 
(GmRNA = GGen), which essentially covers the complete genome, and the reference protein 
abundance set (GProt) which contains the proteins in data sets 2-DE #1 and 2-DE #2 (see table 1) 
because the protein abundance set is a significantly smaller subset of the genome.  By definition, 
this subset contains only proteins that can be identified by 2-D gel electrophoresis and is therefore 
biased in this sense. The enrichment figures throughout this paper, through a comparison of the 
right and left sides of this figure, show the results of the experimental biases of 2D gels on the data 
set. 
 
Each pie chart represents a composition of a particular protein feature F (for instance, an amino 
acid composition) in a population (represented by the symbol 
m)
.  We can further look at the 
"enrichment" of this feature in one population relative to another (represented by the symbol 
D
, see 
section "Methods" for an explanation of the formalism). 
 
For simplification, we neglect the effects of post-transcriptional and post-translational 
modifications that might alter the features of proteins (they affect the expression levels but this is 
largely accounted for by the measurements).  In this study we analyze protein features as they are 
represented in the genome. 
 
Figure 2,  mRNA expression levels vs. protein abundance levels 
 
Part A of this figure shows the reference protein abundance levels plotted against the mRNA 
reference expression levels on a log-log scale; this plot is similar to the one reported by Futcher et 
al. (1999) earlier.  The trend line is described by the equation y = 5.20x0.61 where y represents the 
protein abundance level (in units of 103 copies/cell) and x the mRNA expression level (in units of 
copies/cell).  The dashed lines indicate a distance of 1.85 standard deviations (in the log scale) from 
the trend line.  The outliers beyond the dashed lines are listed in Part B. For each of these outlier 
ORFs we show a description of their function and their respective MIPS categories (the numbers 
are defined in Figure 4C).  With one exception, all outliers are associated with cellular organization 
(MIPS category 30).  Those outliers that have a high level of protein abundance relative to the 
expected amount of mRNA expression are dominated by the alcohol and G3P dehydrogenases.  
Translation-related proteins are prominent in the group of those proteins with low protein