Gerstein Lab Publications

Main  •  By Subject  •  Queries  •  Code  •  Other Writings

Gerstein lab subproject of P20 LM07253-01, PI Perry Miller, 9/15/01 to 9/14/04, NIH/NLM

P20 LM07253-01 (PI Miller)
9/15/01 - 8/31/04
Planning a Biomedical Computing Center of Excellence
Role: subproject leader

The overall grant is pilot pre-center grant. The Gerstein lab is responsible for a small subproject focused on using integrative data-mining to predict which proteins in a genome are highly expressed.

Biological Objective: Determine Characteristics of Abundant Proteins

The biological objective of this sub-project is to understand the characteristics of the highly expressed proteins (e.g. function, composition, structure, and so forth) and, perhaps, predict the gene expression levels of a protein based on these characteristics. This may be practically useful for large-scale proteomics projects.

Computational Objective: Integrative Datamining

The computation objective is to address the above question through integrative database analysis and datamining in a fashion that seamlessly connects disparate information resources related to gene expression, functional annotation, regulation, genome sequence, and protein structure. More specifically, we hope to: (i) Develop two linked database systems, one for yeast expression data and the other for information related other protein features. (ii) Identify the features most related to expression through extensively cross-referencing the databases and employing a simple enrichment formalism. (iii) Do datamining and machine learning on the relevant features to see whether a predictive algorithm for expression level can be developed. For the datamining, we will try decision trees first because of the simplicity of the resulting rules and then go to Bayesian networks as a second option.

Articles funded by this grant:
Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models.
R Jansen, HJ Bussemaker, M Gerstein (2003). Nucleic Acids Res 31: 2242-51.

Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts.
D Greenbaum, R Jansen, M Gerstein (2002). Bioinformatics 18: 585-96.

Return to front page