Greenbaum et al 
5
the comparison of individual genes.  Previous analyses have shown that differences between 
mRNA expression and protein abundance level can be quite dramatic for individual genes. This 
may either be due to the noise in the data or to fundamental biological processes. However, our 
analyses show that the variation between transcriptome and translatome is much smaller for global 
properties that are computed by averaging over the properties of many individual genes. 
 
METHODS 
Data Sources Used  
For our analysis we culled many divergent data sets, representing protein abundance and mRNA 
expression experiments and also other sources of genome annotation. These are all summarized in 
Table 1. Briefly, they included two protein abundance sets, measured via 2-dimensional gel 
electrophoresis and mass spectrometry.  We termed these 2-DE #1 (Gygi et al. 1999) and 2-DE #2 
(Futcher et al. 1999).  These sets, while admittedly small in comparison to the size of expression 
data sets, represent the largest amount of information on protein abundance publicly available at the 
present. We also apply our methodology, with limited success, to the semi-quantitative Transposon 
insertion data set that measures the LacZ expression of fusion proteins (Ross-Macdonald et al. 
1999).  Although this set contains many more genes than either of the gel electrophoresis sets, and 
thus is an appealing source of protein abundance information, the more qualitative nature of the 
data makes comparisons with other data sets difficult. 
 
Our mRNA expression data came from multiple laboratories that used either Gene Chip or SAGE 
technology.  The Gene Chip sets included the Young Expression Set (Holstege et al. 1998), the 
Church Expression Set (Roth et al. 1998) and the Samson Expression Set (Jelinsky & Samson 
1999).  We used data representing the vegetative state of yeast from all of the above experiments. 
We also compiled two reference sets to be used in our comparisons, one for protein abundance and 
another for mRNA expression (summarized below).  Finally, we used many different types of 
genome annotation in our analysis, which are summarized in Table1.  In particular, the Munich 
Information Center for Protein Sequences (MIPS), a site containing a large number of databases 
(Mewes et al. 2000), proved to be an invaluable source of data specifically in regard to functional 
categories. 
Biases in the Data 
There is a caveat to the usage of data from high-throughput experimentation (i.e. microarrays and 
two-dimensional gel electrophoresis).  With all high throughput expression studies there always 
exists the difficulty of maintaining consistent biological and processing conditions across the assay.  
Moreover, the databases that annotate the specific genes may not always be accurate (Ishii et al 
2000).  Gene chip experiments suffer with regard to cross hybridization and the saturation of probes 
for the highly expressed genes.  SAGE data is not always reliable for assessing ORFs with low 
expression levels.  With regard to 2D gels, although the technology has undergone many 
improvements since its introduction over a quarter century ago (Klose 1975; O'Farrell 1975), there 
remain many aspects of the procedure that introduce biases into the data.  These include the 
inability to resolve membrane proteins (approximately 30% of the genome) and basic proteins 
(Gerstein 1998; Krogh et al 2001).  Moreover, there exist some biases in the data that, as in any 
compilation, reflect the tendencies of the investigator.  These include the lack of low abundance