Greenbaum et al 
9
 
Enrichment of Features 
Formalism 
Figure 2 focuses on individual proteins. In the next part of our analysis, we want to group a number 
of proteins together into various categories based on common features and characterize those 
features that are enriched in one population relative to another, i.e. the translatome population of 
proteins as measured by 2D gels relative to the transcriptome population of transcripts or the 
genome population of genes.  To this end, we set up a formalism that could be applied universally 
to all the attributes that we were interested in. Due to the limitations of the experiments, the 
translatome, transcriptome, and genome populations are defined on different sets of genes, and 
sometimes we want to remove this selection bias by forcing them to be compared on exactly the 
same set of genes.  This is a key aspect of our formalism as presented in figure 1. 
We call an entity like [w, G] a "population", where G is a set describing a particular 
selection of genes from the genome and w is vector of weights associated with each element of this 
population. In particular, we focus on three main populations here: 
(i)  [1,GGen] is the population of genes in the genome, all 6280 genes weighted once (w = 1).  
(ii)  [wmRNA, GmRNA] is the observed population of the transcripts in the transcriptome, i.e. 
the 6249 genes in the reference expression set weighted by their reference expression 
value. 
(iii)[wProt, GProt] is the observed cellular population of the proteins in the translatome, i.e. 
the 181 genes in the reference abundance set weighted by their reference abundance 
value. 
(The set of genes in the genome GGen is approximately equal to the genes in set GmRNA, such that we 
can use both symbols interchangeably.) We can also use this notation to describe specific 
experiments -- e.g. [wlacZ, GlacZ] describes the gene set and weights relating to the Transposon 
Abundance set.  
 
Furthermore, we define Fj as the value of a feature F in ORF j.  For example, F could be the 
composition of leucine (a real number) or a binary value (0 or 1) indicating whether an ORF 
contains a trans-membrane segment. Given these definitions, the weighted average of feature F in 
population [w, G] is: 
 
 
 
Î
Î
º
G
j
j
G
j
j
j
w
F
w
G
F
])
,
[
,
(
w
m
 
The weighted averages of two populations [w, G] and [v, S] can be compared by simply looking at 
their relative difference  : 
 
])
,
[
,
(
])
,
[
,
(
])
,
[
,
(
])
,
[
],
,
[
,
(
G
F
G
F
S
F
G
S
F
w
w
v
w
v
m
m
m
-
=
D
 
 
where v and w are weights for the sets of ORFs S and G respectively.  We call   the "enrichment" 
of feature F because it indicates whether F is enriched (if   is positive) or depleted (if   is