text_figures[1]

Greenbaum et al 3 INTRODUCTION With the recent popularity of high-throughput experimentation, biologists have begun to create a large inventory of scientific data (Claverie 1999; Einarson & Golemis 2000; Epstein & Butow 2000; Shapiro & Harris 2000). Much of this has come from expression experiments, partially fueled by the advent and continuous evolution of the microarray and Gene Chip systems. These experiments allow for large scale, comprehensive scans of gene expression within the cell (Schena et al. 1995; Eisen & Brown 1999; Ferea & Brown 1999; Lipshutz 1999). Expression data sets are currently the single richest source of information in genomics, and for yeast, expression information now dwarfs that in the sequence alone. However, "theory" has not kept up with experimentation in this area, and how to best interpret the vast amount of data generated by these experiments is still a very open question (Bassett et al. 1996; Wittes & Friedman 1999; Zhang 1999; Gerstein & Jansen 2000; Searls 2000; Sherlock 2000). Genome-wide experimentation has also been used to directly measure the cellular population of proteins (protein abundance). (Anderson & Seilhamer 1997; Futcher et al. 1999; Gygi et al. 1999; Ross-Macdonald et al. 1999) Understanding how protein abundance is related to mRNA transcript levels is essential for interpreting gene expression and also, more generally, for understanding the interactions, structures and functions in a cellular system (Hatzimanikatis et al. 1999). Moreover, as protein concentration, rather than transcript population, is the more relevant variable with respect to enzyme activity, it is this quantity that connects genomics to the physical chemistry and dynamics of the cell (Kidd et al 2001). Finally, protein abundance levels may become invaluable for diagnostic methods as well as for determining new drug targets (Corthals 2000). High- throughput two-dimensional gel electrophoresis (2-DE), in conjunction with mass spectrometry, has been used to identify proteins that can then be quantified to determine protein abundance (Futcher et al. 1999; Gygi et al. 1999; Harry et al. 2000). Other technologies include using random integration of reporter transposons in yeast (Ross-Macdonald et al. 1999), and modifying the microarray concept for use with proteins (Lopez 2000; MacBeath & Schreiber 2000; Nelson et al. 2000; Zhu et al. 2000). Gene expression is indirectly related to cellular protein abundance through the process of translation. The cell connects mRNA expression and protein abundance through translational control, which is primarily regulated at the initiation of translation (Lindahl & Hinnebusch 1992; Jackson & Wickens 1997; Day & Tuite 1998; McCarthy 1998). Much of this control is the result of multiple cis-acting elements in the mRNA (Jacobs Anderson & Parker 2000). There are large non- coding regions in each mRNA species devoted to regulation of that mRNA as well as its stability and degradation properties, including 5` and 3` UTRs, uORFs and uAUGs (Vilela et al. 1998; Vilela et al. 1999; Morris & Geballe 2000). Previously, we surveyed the population of protein features -- such as folds, amino acid composition, and functions -- in yeast, and a number of the other recently sequenced genomes (Gerstein 1997; Gerstein 1998; Gerstein 1998; Gerstein & Hegyi 1998; Hegyi & Gerstein 1999; Das & Gerstein 2000; Lin & Gerstein 2000). Others have also done related work (Frishman & Mewes 1997; Tatusov et al. 1997; Jones 1998; Wallin & von Heijne 1998; Frishman & Mewes 1999; Wolf et al. 1999). Recently, we extended this concept to compare the population of features