In 2007, we have made significant progress in our research goals. We continue to focus on genome annotation, network analysis and macromolecular motions. We have developed new tools to process various data arising from functional genomics efforts. We briefly describe some of the results from our research below.

As part of the ENCODE consortium, we have worked on several different computational aspects of identifying functional elements in the ENCODE regions, a representative 1% of the human genome (Birney et al, 2007). Some examples of this analysis include identification of STAT1 binding regions using both ChIP-chip and ChIP-PET technologies. This study provides information about scoring and validation of Chip-based technologies (Euskirchen et al, 2007). We have also statistically analyzed the genomic distribution of transcriptional regulatory sites utilizing the wealth of data generated by the ENCODE projects (Zhang et al, 2007a). We developed a tool for classification of transcriptionally active regions using several features such as expression levels, their genomic location relative to genes, sequence composition and phylogenic conservation (Rozowsky et al, 2007). This tool will allow one to classify TARs and will aid in the design of further experiments to understand the function of many RNAs transcribed in human cells.

Also, in terms of genome annotation, we have worked extensively on several aspects of genome analysis. We developed a high throughput massive paired end mapping sequencing technology method and a computational pipeline to identify large-scale structural variations in the human genome, variations ~3kb or larger (Korbel et al, 2007a). We characterized interspecies diversity in gene regulation by doing a ChIP-Chip analysis of transcription factors Ste12 and Tec1 in 3 yeast species and an evolutionary comparison of binding sites of conserved TF sites across the 3 species (Borneman et al, 2007a). Despite close sequence similarity, this analysis showed that extensive binding site difference probably explains the differences between these organisms that adapt to their distinct ecological niche. Both these papers were published in Science.

On the pseudogene front, we have updated the database (Karro et al, 2007). We provide tools for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. This database now includes pseudogene data for 64 prokaryotes and 11 eukaroyotes and new genomes are constantly added as new data are obtained. We have also done a very careful annotation of pseudogenes in the ENCODE regions and have studied the prevalence of transcription in pseudogenes based on carefully designed RACE experiments. These results in conjunction with tiling array data and high-throughput sequencing indicate that a fifth of the pseudogenes are transcriptionally active (Zheng et al, 2007a).

We have published several papers on network analysis and highlight a few here. Based on a network analysis of Hsp90 and its interacting proteins, we characterized proteins that need help in folding relative to the background (McClellan et al, 2007). Graph theoretic analysis of Hsp90 targets in relation to the yeast target network in combination with GO functional annotation suggested a role for Hsp90 in cellular trafficking and transport as well as in cell cycle regulation. We were able to identify potential targets for experimental studies. Lu et al (2007) compare classical pathways and networks and identify the inadequacy of current definitions to adequately convey all information in pathways. We propose a prototype for edge ontology to overcome this deficiency in the current edge representation. Finally, in Kim et al. (2007), we showed that one tends to find positively selected proteins on the network periphery -- suggesting how the human variation is arranged with respect to the interactome.

Using the Molecular Motions Database, we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges (Flores et al 2007). We have also developed FlexOracle, a new hinge prediction method to study hinge motions in proteins (Flores and Gerstein, 2007). This is based on the idea that energetic interactions are stronger within structural domains than between them, and that fragments generated by cleaving the protein at the hinge site are independently stable.

Finally, we have published several papers focusing on more efficient ways of data storage, information retrieval and dissemination. We have developed LinkHub, a semantic web-based system that facilitates cross-database queries and information retrieval in proteomics (Smith et al, 2007b). Given the deluge of data obtained from genome-scale experiments, it is evident that the traditional ways of disseminating information in the form of a published article is neither efficient nor complete. We suggest improvements in data dissemination and propose a new information architecture with an expansive central index (Seringhaus and Gerstein, 2007). Articles suitable for digital parsing and structured digital abstracts will improve text mining capabilities and bridge the gap between the traditional route of data sharing in the form of a published paper and huge data dumps in databases.

Semantic Web Approach to Database Integration in the Life Sciences
KH Cheung, AK Smith, KYL Yip, CJO Baker, MB Gerstein (2007). in Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (eds. C Baker and K Cheung, Springer, NY), pp. 11-30

Semantic Web Standards: Legal and Social Issues and Implications
D Greenbaum, M Gerstein (2007). in Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (eds. C Baker and K Cheung, Springer, NY), pp. 413-433

Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context.
PM Kim, JO Korbel, MB Gerstein (2007). Proc Natl Acad Sci U S A 104: 20274-9.

Integrative microarray analysis of pathways dysregulated in metastatic prostate cancer.
SR Setlur, TE Royce, A Sboner, JM Mosquera, F Demichelis, MD Hofer, KD Mertz, M Gerstein, MA Rubin (2007). Cancer Res 67: 10296-303.

Leveraging the structure of the Semantic Web to enhance information retrieval for proteomics.
A Smith, K Cheung, M Krauthammer, M Schultz, M Gerstein (2007). Bioinformatics 23: 3073-9.

Diverse cellular functions of the Hsp90 molecular chaperone uncovered using systems approaches.
AJ McClellan, Y Xia, AM Deutschbauer, RW Davis, M Gerstein, J Frydman (2007). Cell 131: 121-35.

Paired-end mapping reveals extensive structural variation in the human genome.
JO Korbel, AE Urban, JP Affourtit, B Godwin, F Grubert, JF Simons, PM Kim, D Palejev, NJ Carriero, L Du, BE Taillon, Z Chen, A Tanzer, AC Saunders, J Chi, F Yang, NP Carter, ME Hurles, SM Weissman, TT Harkins, MB Gerstein, M Egholm, M Snyder (2007). Science 318: 420-6.

PARE: a tool for comparing protein abundance and mRNA expression data.
EZ Yu, AE Burba, M Gerstein (2007). BMC Bioinformatics 8: 309.

Divergence of transcription factor binding sites across related yeast species.
AR Borneman, TA Gianoulis, ZD Zhang, H Yu, J Rozowsky, MR Seringhaus, LY Wang, M Gerstein, M Snyder (2007). Science 317: 815-9.

The minimum information required for reporting a molecular interaction experiment (MIMIx).
S Orchard, L Salwinski, S Kerrien, L Montecchi-Palazzi, M Oesterheld, V Stumpflen, A Ceol, A Chatr-aryamontri, J Armstrong, P Woollard, JJ Salama, S Moore, J Wojcik, GD Bader, M Vidal, ME Cusick, M Gerstein, AC Gavin, G Superti-Furga, J Greenblatt, J Bader, P Uetz, M Tyers, P Legrain, S Fields, N Mulder, M Gilson, M Niepmann, L Burgoon, J De Las Rivas, C Prieto, VM Perreau, C Hogue, HW Mewes, R Apweiler, I Xenarios, D Eisenberg, G Cesareni, H Hermjakob (2007). Nat Biotechnol 25: 894-8.

Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification.
TE Royce, JS Rozowsky, MB Gerstein (2007). Nucleic Acids Res 35: e99.

Transcription factor binding site identification in yeast: a comparison of high-density oligonucleotide and PCR-based microarray platforms.
AR Borneman, ZD Zhang, J Rozowsky, MR Seringhaus, M Gerstein, M Snyder (2007). Funct Integr Genomics 7: 335-45.

FlexOracle: predicting flexible hinges by identification of stable domains.
SC Flores, MB Gerstein (2007). BMC Bioinformatics 8: 215.

Comparing classical pathways and modern networks: towards the development of an edge ontology.
LJ Lu, A Sboner, YJ Huang, HX Lu, TA Gianoulis, KY Yip, PM Kim, GT Montelione, MB Gerstein (2007). Trends Biochem Sci 32: 320-31.

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
The ENCODE Project Consortium (2007). Nature 447: 799-816.

Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies.
GM Euskirchen, JS Rozowsky, CL Wei, WH Lee, ZD Zhang, S Hartman, O Emanuelsson, V Stolc, S Weissman, MB Gerstein, Y Ruan, M Snyder (2007). Genome Res 17: 898-909.

Structured RNAs in the ENCODE selected regions of the human genome.
S Washietl, JS Pedersen, JO Korbel, C Stocsits, AR Gruber, J Hackermuller, J Hertel, M Lindemeyer, K Reiche, A Tanzer, C Ucla, C Wyss, SE Antonarakis, F Denoeud, J Lagarde, J Drenkow, P Kapranov, TR Gingeras, R Guigo, M Snyder, MB Gerstein, A Reymond, IL Hofacker, PF Stadler (2007). Genome Res 17: 852-64.

Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution.
D Zheng, A Frankish, R Baertsch, P Kapranov, A Reymond, SW Choo, Y Lu, F Denoeud, SE Antonarakis, M Snyder, Y Ruan, CL Wei, TR Gingeras, R Guigo, J Harrow, MB Gerstein (2007). Genome Res 17: 839-51.

Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions.
ZD Zhang, A Paccanaro, Y Fu, S Weissman, Z Weng, J Chang, M Snyder, MB Gerstein (2007). Genome Res 17: 787-97.

The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci.
JS Rozowsky, D Newburger, F Sayward, J Wu, G Jordan, JO Korbel, U Nagalakshmi, J Yang, D Zheng, R Guigo, TR Gingeras, S Weissman, P Miller, M Snyder, MB Gerstein (2007). Genome Res 17: 732-45.

Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome.
ND Trinklein, U Karaoz, J Wu, A Halees, S Force Aldred, PJ Collins, D Zheng, ZD Zhang, MB Gerstein, M Snyder, RM Myers, Z Weng (2007). Genome Res 17: 720-31.

What is a gene, post-ENCODE? History and updated definition.
MB Gerstein, C Bruce, JS Rozowsky, D Zheng, J Du, JO Korbel, O Emanuelsson, ZD Zhang, S Weissman, M Snyder (2007). Genome Res 17: 669-81.

An efficient pseudomedian filter for tiling microrrays.
TE Royce, NJ Carriero, MB Gerstein (2007). BMC Bioinformatics 8: 186.

Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome.
JO Korbel, AE Urban, F Grubert, J Du, TE Royce, P Starr, G Zhong, BS Emanuel, SM Weissman, M Snyder, MB Gerstein (2007). Proc Natl Acad Sci U S A 104: 10110-5.

Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications.
H Yu, R Jansen, G Stolovitzky, M Gerstein (2007). Bioinformatics 23: 2163-73.

Global survey of human T leukemic cells by integrating proteomics and transcriptomics profiling.
L Wu, SI Hwang, K Rezaul, LJ Lu, V Mayya, M Gerstein, JK Eng, DH Lundgren, DK Han (2007). Mol Cell Proteomics 6: 1343-53.

Hinge Atlas: relating protein sequence to sites of structural flexibility.
SC Flores, LJ Lu, J Yang, N Carriero, MB Gerstein (2007). BMC Bioinformatics 8: 167.

Tilescope: online analysis pipeline for high-density tiling microarray data.
ZD Zhang, J Rozowsky, HY Lam, J Du, M Snyder, M Gerstein (2007). Genome Biol 8: R81.

Structured digital abstract makes text mining easy.
M Gerstein, M Seringhaus, S Fields (2007). Nature 447: 142.

LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics.
AK Smith, KH Cheung, KY Yip, M Schultz, MK Gerstein (2007). BMC Bioinformatics 8 Suppl 3: S5.

Getting connected: analysis and principles of biological networks.
X Zhu, M Gerstein, M Snyder (2007). Genes Dev 21: 1010-24.

RNAi development.
M Gerstein, SM Douglas (2007). PLoS Comput Biol 3: e80.

The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics.
H Yu, PM Kim, E Sprecher, V Trifonov, M Gerstein (2007). PLoS Comput Biol 3: e59.

Assessing the need for sequence-based normalization in tiling microarray experiments.
TE Royce, JS Rozowsky, MB Gerstein (2007). Bioinformatics 23: 988-97.

The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they?
D Zheng, MB Gerstein (2007). Trends Genet 23: 219-24.

Global identification and characterization of transcriptionally active regions in the rice genome.
L Li, X Wang, R Sasidharan, V Stolc, W Deng, H He, J Korbel, X Chen, W Tongprasit, P Ronald, R Chen, M Gerstein, XW Deng (2007). PLoS One 2: e294.

Differential binding of calmodulin-related proteins to their targets revealed through high-density Arabidopsis protein microarrays.
SC Popescu, GV Popescu, S Bachan, Z Zhang, M Seay, M Gerstein, M Snyder, SP Dinesh-Kumar (2007). Proc Natl Acad Sci U S A 104: 4730-5.

New insights into Acinetobacter baumannii pathogenesis revealed by high-density pyrosequencing and transposon mutagenesis.
MG Smith, TA Gianoulis, S Pukatzki, JJ Mekalanos, LN Ornston, M Gerstein, M Snyder (2007). Genes Dev 21: 601-14.

Comparative analysis of genome tiling array data reveals many novel primate-specific functional RNAs in human.
Z Zhang, AW Pang, M Gerstein (2007). BMC Evol Biol 7 Suppl 1: S14.

Publishing perishing? Towards tomorrow's information architecture.
MR Seringhaus, MB Gerstein (2007). BMC Bioinformatics 8: 17.

Chemistry Nobel rich in structure.
M Seringhaus, M Gerstein (2007). Science 315: 40-1.

Positional artifacts in microarrays: experimental verification and construction of COP, an automated detection tool.
H Yu, K Nguyen, T Royce, J Qian, K Nelson, M Snyder, M Gerstein (2007). Nucleic Acids Res 35: e8.

Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome.
O Emanuelsson, U Nagalakshmi, D Zheng, JS Rozowsky, AE Urban, J Du, Z Lian, V Stolc, S Weissman, M Snyder, MB Gerstein (2007). Genome Res 17: 886-97. a comprehensive database and comparison platform for pseudogene annotation.
JE Karro, Y Yan, D Zheng, Z Zhang, N Carriero, P Cayting, P Harrrison, M Gerstein (2007). Nucleic Acids Res 35: D55-60.

An interdepartmental Ph.D. program in computational biology and bioinformatics: the Yale perspective.
M Gerstein, D Greenbaum, K Cheung, PL Miller (2007). J Biomed Inform 40: 73-9.

Return to front page