Gerstein Lab Publications

Main  •  By Subject  •  Queries  •  Code  •  Other Writings


In 2007, we have made significant progress in our research goals. We continue to focus on genome annotation, network analysis and macromolecular motions. We have developed new tools to process various data arising from functional genomics efforts. We briefly describe some of the results from our research below.

As part of the ENCODE consortium, we have worked on several different computational aspects of identifying functional elements in the ENCODE regions, a representative 1% of the human genome (Birney et al, 2007). Some examples of this analysis include identification of STAT1 binding regions using both ChIP-chip and ChIP-PET technologies. This study provides information about scoring and validation of Chip-based technologies (Euskirchen et al, 2007). We have also statistically analyzed the genomic distribution of transcriptional regulatory sites utilizing the wealth of data generated by the ENCODE projects (Zhang et al, 2007a). We developed a tool for classification of transcriptionally active regions using several features such as expression levels, their genomic location relative to genes, sequence composition and phylogenic conservation (Rozowsky et al, 2007). This tool will allow one to classify TARs and will aid in the design of further experiments to understand the function of many RNAs transcribed in human cells.

Also, in terms of genome annotation, we have worked extensively on several aspects of genome analysis. We developed a high throughput massive paired end mapping sequencing technology method and a computational pipeline to identify large-scale structural variations in the human genome, variations ~3kb or larger (Korbel et al, 2007a). We characterized interspecies diversity in gene regulation by doing a ChIP-Chip analysis of transcription factors Ste12 and Tec1 in 3 yeast species and an evolutionary comparison of binding sites of conserved TF sites across the 3 species (Borneman et al, 2007a). Despite close sequence similarity, this analysis showed that extensive binding site difference probably explains the differences between these organisms that adapt to their distinct ecological niche. Both these papers were published in Science.

On the pseudogene front, we have updated the pseudogene.org database (Karro et al, 2007). We provide tools for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. This database now includes pseudogene data for 64 prokaryotes and 11 eukaroyotes and new genomes are constantly added as new data are obtained. We have also done a very careful annotation of pseudogenes in the ENCODE regions and have studied the prevalence of transcription in pseudogenes based on carefully designed RACE experiments. These results in conjunction with tiling array data and high-throughput sequencing indicate that a fifth of the pseudogenes are transcriptionally active (Zheng et al, 2007a).

We have published several papers on network analysis and highlight a few here. Based on a network analysis of Hsp90 and its interacting proteins, we characterized proteins that need help in folding relative to the background (McClellan et al, 2007). Graph theoretic analysis of Hsp90 targets in relation to the yeast target network in combination with GO functional annotation suggested a role for Hsp90 in cellular trafficking and transport as well as in cell cycle regulation. We were able to identify potential targets for experimental studies. Lu et al (2007) compare classical pathways and networks and identify the inadequacy of current definitions to adequately convey all information in pathways. We propose a prototype for edge ontology to overcome this deficiency in the current edge representation. Finally, in Kim et al. (2007), we showed that one tends to find positively selected proteins on the network periphery -- suggesting how the human variation is arranged with respect to the interactome.

Using the Molecular Motions Database, we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges (Flores et al 2007). We have also developed FlexOracle, a new hinge prediction method to study hinge motions in proteins (Flores and Gerstein, 2007). This is based on the idea that energetic interactions are stronger within structural domains than between them, and that fragments generated by cleaving the protein at the hinge site are independently stable.

Finally, we have published several papers focusing on more efficient ways of data storage, information retrieval and dissemination. We have developed LinkHub, a semantic web-based system that facilitates cross-database queries and information retrieval in proteomics (Smith et al, 2007b). Given the deluge of data obtained from genome-scale experiments, it is evident that the traditional ways of disseminating information in the form of a published article is neither efficient nor complete. We suggest improvements in data dissemination and propose a new information architecture with an expansive central index (Seringhaus and Gerstein, 2007). Articles suitable for digital parsing and structured digital abstracts will improve text mining capabilities and bridge the gap between the traditional route of data sharing in the form of a published paper and huge data dumps in databases.


Semantic Web Approach to Database Integration in the Life Sciences
KH Cheung, AK Smith, KYL Yip, CJO Baker, MB Gerstein (2007). in Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (eds. C Baker and K Cheung, Springer, NY), pp. 11-30
website
preprint
 

Semantic Web Standards: Legal and Social Issues and Implications
D Greenbaum, M Gerstein (2007). in Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (eds. C Baker and K Cheung, Springer, NY), pp. 413-433
website
preprint
 

Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context.
PM Kim, JO Korbel, MB Gerstein (2007). Proc Natl Acad Sci U S A 104: 20274-9.
website
preprint
medline

Integrative microarray analysis of pathways dysregulated in metastatic prostate cancer.
SR Setlur, TE Royce, A Sboner, JM Mosquera, F Demichelis, MD Hofer, KD Mertz, M Gerstein, MA Rubin (2007). Cancer Res 67: 10296-303.
 
preprint
medline

Leveraging the structure of the Semantic Web to enhance information retrieval for proteomics.
A Smith, K Cheung, M Krauthammer, M Schultz, M Gerstein (2007). Bioinformatics 23: 3073-9.
website
preprint
medline

Diverse cellular functions of the Hsp90 molecular chaperone uncovered using systems approaches.
AJ McClellan, Y Xia, AM Deutschbauer, RW Davis, M Gerstein, J Frydman (2007). Cell 131: 121-35.
 
preprint
medline

Paired-end mapping reveals extensive structural variation in the human genome.
JO Korbel, AE Urban, JP Affourtit, B Godwin, F Grubert, JF Simons, PM Kim, D Palejev, NJ Carriero, L Du, BE Taillon, Z Chen, A Tanzer, AC Saunders, J Chi, F Yang, NP Carter, ME Hurles, SM Weissman, TT Harkins, MB Gerstein, M Egholm, M Snyder (2007). Science 318: 420-6.
website
preprint
medline

PARE: a tool for comparing protein abundance and mRNA expression data.
EZ Yu, AE Burba, M Gerstein (2007). BMC Bioinformatics 8: 309.
website
preprint
medline

Divergence of transcription factor binding sites across related yeast species.
AR Borneman, TA Gianoulis, ZD Zhang, H Yu, J Rozowsky, MR Seringhaus, LY Wang, M Gerstein, M Snyder (2007). Science 317: 815-9.
website
preprint
medline

The minimum information required for reporting a molecular interaction experiment (MIMIx).
S Orchard, L Salwinski, S Kerrien, L Montecchi-Palazzi, M Oesterheld, V Stumpflen, A Ceol, A Chatr-aryamontri, J Armstrong, P Woollard, JJ Salama, S Moore, J Wojcik, GD Bader, M Vidal, ME Cusick, M Gerstein, AC Gavin, G Superti-Furga, J Greenblatt, J Bader, P Uetz, M Tyers, P Legrain, S Fields, N Mulder, M Gilson, M Niepmann, L Burgoon, J De Las Rivas, C Prieto, VM Perreau, C Hogue, HW Mewes, R Apweiler, I Xenarios, D Eisenberg, G Cesareni, H Hermjakob (2007). Nat Biotechnol 25: 894-8.
 
preprint
medline

Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification.
TE Royce, JS Rozowsky, MB Gerstein (2007). Nucleic Acids Res 35: e99.
 
preprint
medline

Transcription factor binding site identification in yeast: a comparison of high-density oligonucleotide and PCR-based microarray platforms.
AR Borneman, ZD Zhang, J Rozowsky, MR Seringhaus, M Gerstein, M Snyder (2007). Funct Integr Genomics 7: 335-45.
 
preprint
medline

FlexOracle: predicting flexible hinges by identification of stable domains.
SC Flores, MB Gerstein (2007). BMC Bioinformatics 8: 215.
website
preprint
medline

Comparing classical pathways and modern networks: towards the development of an edge ontology.
LJ Lu, A Sboner, YJ Huang, HX Lu, TA Gianoulis, KY Yip, PM Kim, GT Montelione, MB Gerstein (2007). Trends Biochem Sci 32: 320-31.
 
preprint
medline

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
The ENCODE Project Consortium (2007). Nature 447: 799-816.
website
preprint
medline

Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies.
GM Euskirchen, JS Rozowsky, CL Wei, WH Lee, ZD Zhang, S Hartman, O Emanuelsson, V Stolc, S Weissman, MB Gerstein, Y Ruan, M Snyder (2007). Genome Res 17: 898-909.
 
preprint
medline

Structured RNAs in the ENCODE selected regions of the human genome.
S Washietl, JS Pedersen, JO Korbel, C Stocsits, AR Gruber, J Hackermuller, J Hertel, M Lindemeyer, K Reiche, A Tanzer, C Ucla, C Wyss, SE Antonarakis, F Denoeud, J Lagarde, J Drenkow, P Kapranov, TR Gingeras, R Guigo, M Snyder, MB Gerstein, A Reymond, IL Hofacker, PF Stadler (2007). Genome Res 17: 852-64.
 
preprint
medline

Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution.
D Zheng, A Frankish, R Baertsch, P Kapranov, A Reymond, SW Choo, Y Lu, F Denoeud, SE Antonarakis, M Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo, J Harrow, MB Gerstein (2007). Genome Res 17: 839-51.
website
preprint
medline

Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions.
ZD Zhang, A Paccanaro, Y Fu, S Weissman, Z Weng, J Chang, M Snyder, MB Gerstein (2007). Genome Res 17: 787-97.
website
preprint
medline

The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci.
JS Rozowsky, D Newburger, F Sayward, J Wu, G Jordan, JO Korbel, U Nagalakshmi, J Yang, D Zheng, R Guigo, TR Gingeras, S Weissman, P Miller, M Snyder, MB Gerstein (2007). Genome Res 17: 732-45.
website
preprint
medline

Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome.
ND Trinklein, U Karaoz, J Wu, A Halees, S Force Aldred, PJ Collins, D Zheng, ZD Zhang, MB Gerstein, M Snyder, RM Myers, Z Weng (2007). Genome Res 17: 720-31.
 
preprint
medline

What is a gene, post-ENCODE? History and updated definition.
MB Gerstein, C Bruce, JS Rozowsky, D Zheng, J Du, JO Korbel, O Emanuelsson, ZD Zhang, S Weissman, M Snyder (2007). Genome Res 17: 669-81.
website
preprint
medline

An efficient pseudomedian filter for tiling microrrays.
TE Royce, NJ Carriero, MB Gerstein (2007). BMC Bioinformatics 8: 186.
website
preprint
medline

Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome.
JO Korbel, AE Urban, F Grubert, J Du, TE Royce, P Starr, G Zhong, BS Emanuel, SM Weissman, M Snyder, MB Gerstein (2007). Proc Natl Acad Sci U S A 104: 10110-5.
website
preprint
medline

Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications.
H Yu, R Jansen, G Stolovitzky, M Gerstein (2007). Bioinformatics 23: 2163-73.
website
preprint
medline

Global survey of human T leukemic cells by integrating proteomics and transcriptomics profiling.
L Wu, SI Hwang, K Rezaul, LJ Lu, V Mayya, M Gerstein, JK Eng, DH Lundgren, DK Han (2007). Mol Cell Proteomics 6: 1343-53.
 
preprint
medline

Hinge Atlas: relating protein sequence to sites of structural flexibility.
SC Flores, LJ Lu, J Yang, N Carriero, MB Gerstein (2007). BMC Bioinformatics 8: 167.
website
preprint
medline

Tilescope: online analysis pipeline for high-density tiling microarray data.
ZD Zhang, J Rozowsky, HY Lam, J Du, M Snyder, M Gerstein (2007). Genome Biol 8: R81.
website
preprint
medline

Structured digital abstract makes text mining easy.
M Gerstein, M Seringhaus, S Fields (2007). Nature 447: 142.
 
preprint
medline

LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics.
AK Smith, KH Cheung, KY Yip, M Schultz, MK Gerstein (2007). BMC Bioinformatics 8 Suppl 3: S5.
website
preprint
medline

Getting connected: analysis and principles of biological networks.
X Zhu, M Gerstein, M Snyder (2007). Genes Dev 21: 1010-24.
 
preprint
medline

RNAi development.
M Gerstein, SM Douglas (2007). PLoS Comput Biol 3: e80.
website
preprint
medline

The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics.
H Yu, PM Kim, E Sprecher, V Trifonov, M Gerstein (2007). PLoS Comput Biol 3: e59.
website
preprint
medline

Assessing the need for sequence-based normalization in tiling microarray experiments.
TE Royce, JS Rozowsky, MB Gerstein (2007). Bioinformatics 23: 988-97.
website
preprint
medline

The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they?
D Zheng, MB Gerstein (2007). Trends Genet 23: 219-24.
website
preprint
medline

Global identification and characterization of transcriptionally active regions in the rice genome.
L Li, X Wang, R Sasidharan, V Stolc, W Deng, H He, J Korbel, X Chen, W Tongprasit, P Ronald, R Chen, M Gerstein, XW Deng (2007). PLoS One 2: e294.
website
preprint
medline

Differential binding of calmodulin-related proteins to their targets revealed through high-density Arabidopsis protein microarrays.
SC Popescu, GV Popescu, S Bachan, Z Zhang, M Seay, M Gerstein, M Snyder, SP Dinesh-Kumar (2007). Proc Natl Acad Sci U S A 104: 4730-5.
website
preprint
medline

New insights into Acinetobacter baumannii pathogenesis revealed by high-density pyrosequencing and transposon mutagenesis.
MG Smith, TA Gianoulis, S Pukatzki, JJ Mekalanos, LN Ornston, M Gerstein, M Snyder (2007). Genes Dev 21: 601-14.
 
preprint
medline

Comparative analysis of genome tiling array data reveals many novel primate-specific functional RNAs in human.
Z Zhang, AW Pang, M Gerstein (2007). BMC Evol Biol 7 Suppl 1: S14.
 
preprint
medline

Publishing perishing? Towards tomorrow's information architecture.
MR Seringhaus, MB Gerstein (2007). BMC Bioinformatics 8: 17.
 
preprint
medline

Chemistry Nobel rich in structure.
M Seringhaus, M Gerstein (2007). Science 315: 40-1.
 
preprint
medline

Positional artifacts in microarrays: experimental verification and construction of COP, an automated detection tool.
H Yu, K Nguyen, T Royce, J Qian, K Nelson, M Snyder, M Gerstein (2007). Nucleic Acids Res 35: e8.
website
preprint
medline

Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome.
O Emanuelsson, U Nagalakshmi, D Zheng, JS Rozowsky, AE Urban, J Du, Z Lian, V Stolc, S Weissman, M Snyder, MB Gerstein (2007). Genome Res 17: 886-97.
website
preprint
medline

Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation.
JE Karro, Y Yan, D Zheng, Z Zhang, N Carriero, P Cayting, P Harrrison, M Gerstein (2007). Nucleic Acids Res 35: D55-60.
website
preprint
medline

An interdepartmental Ph.D. program in computational biology and bioinformatics: the Yale perspective.
M Gerstein, D Greenbaum, K Cheung, PL Miller (2007). J Biomed Inform 40: 73-9.
 
preprint
medline


Return to front page