The rapid progress in next-generation sequencing technology has led to a data deluge. Genomics research has switched from a hypothesis-driven science to data-driven analyses of high throughput data. This has provided us with tremendous opportunities for characterization of the human genome at multiple levels: DNA (genomic), RNA (transcriptomic) and protein (proteomic) level. For the first time, biological research is limited by rate of data analysis rather than data generation. In order to bridge this gap and extract useful information from this wealth of data, we are involved in large consortia efforts to characterize the human genome and other model organisms. We continue to develop new methods for assembly, annotation and functional characterization of the genome.


Comprehensive annotation is the cornerstone for genomic analyses. A high-quality annotation set is essential for correct interpretation of functional genomics studies. We have been involved in systematic genome annotation for several years now. Coding regions have historically been the subject of intensive annotation efforts. Nonetheless, the function of some protein-coding genes is still unknown. A commonly used method to infer function of such genes is to look for functionally well-characterized homologs in other organisms. Identification of orthologs is not straightforward. We have elaborated on this aspect in a Plos Computational Biology review (Fang et al., 2010). In this article, we discuss the general procedures used to identify orthologs and focus on the functional analyses of orthologs and make suggestions to construct better ortholog groups. Construction of ortholog groups is fundamental to many objectives, such as transferring annotation to newly sequenced genomes, and pathway comparisons across species.

Most of the human genome consists of DNA that does not code for proteins. The role of such noncoding regions had been previously neglected. But the behemoth amounts of functional genomics data clearly indicate that noncoding regions have important regulatory roles. We are working on various kinds of noncoding genome annotations. Previously, we have focused on identification and annotation of pseudogenes and developed computational methods such as PeakSeq and ChIP-Seq to identify transcription factor binding sites. In addition to pseudogene annotation, we are involved in annotation and functional interpretation of noncoding regions of the genome. We briefly highlight our various research directions in genome annotation below.

On the pseudogene front, we have identified 76 human-specific pseudogenes called unitary pseudogenes that have genic counterparts in other organisms (Zhang et al., 2010a). Thus, these represent human-specific gene inactivation events. Based on an analysis of the segmental duplications and pseudogenes, we have also identified 140 novel duplicated pseudogenes leading to an improved annotation for the 3172 pseudogenes located in segmental duplications based on an analysis of the interrelationship of segmental duplications and pseudogenes (Khurana et al., 2010).

Annotating functional regions in the non-coding genome involves two complementary analysis techniques: comparative analysis, which involves examining DNA sequences, and functional analysis, which involves examining the output of functional genomics experiments. We have comprehensively discussed the annotation of noncoding regions of the genome in a Nature Genetics Review (Alexander et al., 2010).

We have developed FusionSeq, a modular computational framework that identifies fusion transcripts from paired-end RNASEQ data (Sboner et al., 2010). It includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. In addition, it has a module to identify exact sequences at breakpoint junctions. We used FusionSeq to detect known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.

We are involved in several large-scale annotation projects and describe below the main activities in the year 2010.

1. The 1000 genomes project (http://1000genomes.org/)

This is essentially NIH's marquee effort on personal genomics, the sequencing of individual people's genomes. In October 2010, we completed our pilot project on ~200 people, which was published in Nature (1000 Genomes Consortium paper, 2010). We are involved in developing pipelines for analyzing the massive amounts of sequence information as part of the 1000 genomes consortium. We are extensively involved in three sub-groups within 1000 genomes: a. Structural variation (SV) b. Loss of function (LOF) c. Indel calling methods and analyses.

We have pioneered the development of methods for detection and precise mapping of SVs. Given the flood of personal genome sequencing efforts, defining the precise location of SVs at single-nucleotide breakpoint resolution is essential for reconstructing personal genome sequences and inferring the functional impact of SVs. To this end, we assembled a library of breakpoints at nucleotide resolution by collating and standardizing ~2,000 published SVs We characterized breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties. We then developed BreakSeq, a method for detecting SVs by aligning raw reads directly onto SV breakpoint junctions of the alternative, nonreference, alleles contained in our library (Lam et al., 2010a). Application of BreakSeq on a large scale can be used for genotyping and determining SV allele frequencies. In addition, we have recently developed several algorithms for SV discovery. These include a Bayesian statistical analysis algorithm for the detection of CNVs from array-CGH and recent whole genome sequencing based on next generation technologies (Zhang et al., 2010b). Using these different approaches, we built a high quality SV dataset for the 1000 genomes project which has been recently published in Nature (Mills et al., 2011).

2. modENCODE (http://www.modencode.org/)

modENCODE is a NIH initiative to identify all functional elements in model organisms. We spearheaded this multi-institute project for the model organism Caenorhabditis elegans. Using a variety of tools and methods, we have annotated and elucidated the functional elements of most of the conserved genome of C. elegans. This work was recently published in Science (Gerstein et al., 2010).


We continue to work on understanding the myriad interactions between proteins and other molecules in the cell and are attempting to decipher the interplay between the different entities to better understand cellular function. To this end, we reconstructed regulatory networks by combining complementary data: deletion data and time series data of gene expression after some initial perturbation (Yip et al., 2010). Comparison between our predicted networks and the actual networks shows that integrating different types of data is an effective way to predict regulatory networks.

We also compared the transcriptional regulatory network of Escherichia coli to the network depiction, call graph of a canonical OS (Linux) in terms of topology and evolution (Yan et al., 2010). We showed that even though both networks have a hierarchical layout, the transcriptional regulatory network possesses a few global regulators at the top and many targets at the bottom; whereas, the call graph has many regulators controlling a small set of generic functions. This leads to highly overlapping functional modules in the call graph, in contrast to the relatively independent modules in the regulatory network. The process of biological evolution via random mutation and subsequent selection tightly constrains the evolution of regulatory network hubs. The call graph, however, exhibits rapid evolution of its highly connected generic components, made possible by designers' continual fine-tuning. These findings arise from the design principles of the two systems: robustness for biological systems and cost effectiveness (reuse) for software systems.

We analyzed combinatorial regulation based on co-transcription and co-phosphorylation networks across five species varying from E. Coli to human. We found that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets (Bhardwaj et al., 2010a). We also analyzed the effects of network rewiring events on transcriptional regulatory hierarchies in two species: E.Coli and S. Cerevisiae (Bhardwaj et al., 2010b). We found that rewiring events that affected upper levels had a more marked effect on cell proliferation rate and survival than those involving lower levels. We showed that the hierarchical level and type of change better reflected the phenotypic effect of rewiring than did the number of changes.

Macromolecular structure and function

We developed a novel method, RigidFinder, for identification of rigid blocks from different conformations-across many scales, from large complexes to small loops (Abyzov et al., 2010). RigidFinder defines rigidity in terms of blocks, where inter-residue distances are conserved across conformations. Unlike most other methods that use averaged values such as RMSD, to identify motions, our method is based on distance conservation that allows for sensitive identification of motions.

Using global ocean sampling (GOS) data, we found approximately 900,000 membrane proteins in large-scale metagenomic sequence (Patel et al., 2010). About a fifth of these membrane proteins are novel proteins. We show that there is widespread variation in membrane protein content across marine sites, which is correlated with changes in both oceanographic variables and human factors. For instance, we find that the occurrence of iron transporters is connected to the amount of shipping, pollution, and iron-containing dust.

Many protein interactions involve short linear motifs consisting of 5-10 amino acid residues that interact with modular protein domains such as the SH3 binding domains and the kinase catalytic domains (Lam et al., 2010b). We have developed an efficient search algorithm to scan the target proteome for potential domain targets and to increase the accuracy of each hit by integrating a variety of pre-computed features, such as conservation, surface propensity, and disorder.

Reproducible Research: Addressing the need for data and code sharing in computational science
Yale Law School Roundtable on Data and Code Sharing (2010). Computing in Science & Engineering 12(5): 8-13 (Sept/Oct).

Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.
MB Gerstein, ZJ Lu, EL Van Nostrand, C Cheng, BI Arshinoff, T Liu, KY Yip, R Robilotto, A Rechtsteiner, K Ikegami, P Alves, A Chateigner, M Perry, M Morris, RK Auerbach, X Feng, J Leng, A Vielle, W Niu, K Rhrissorrakrai, A Agarwal, RP Alexander, G Barber, CM Brdlik, J Brennan, JJ Brouillet, A Carr, MS Cheung, H Clawson, S Contrino, LO Dannenberg, AF Dernburg, A Desai, L Dick, AC Dose, J Du, T Egelhofer, S Ercan, G Euskirchen, B Ewing, EA Feingold, R Gassmann, PJ Good, P Green, F Gullier, M Gutwein, MS Guyer, L Habegger, T Han, JG Henikoff, SR Henz, A Hinrichs, H Holster, T Hyman, AL Iniguez, J Janette, M Jensen, M Kato, WJ Kent, E Kephart, V Khivansara, E Khurana, JK Kim, P Kolasinska-Zwierz, EC Lai, I Latorre, A Leahey, S Lewis, P Lloyd, L Lochovsky, RF Lowdon, Y Lubling, R Lyne, M MacCoss, SD Mackowiak, M Mangone, S McKay, D Mecenas, G Merrihew, DM Miller, A Muroyama, JI Murray, SL Ooi, H Pham, T Phippen, EA Preston, N Rajewsky, G Ratsch, H Rosenbaum, J Rozowsky, K Rutherford, P Ruzanov, M Sarov, R Sasidharan, A Sboner, P Scheid, E Segal, H Shin, C Shou, FJ Slack, C Slightam, R Smith, WC Spencer, EO Stinson, S Taing, T Takasaki, D Vafeados, K Voronina, G Wang, NL Washington, CM Whittle, B Wu, KK Yan, G Zeller, Z Zha, M Zhong, X Zhou, modENCODE Consortium, J Ahringer, S Strome, KC Gunsalus, G Micklem, XS Liu, V Reinke, SK Kim, LW Hillier, S Henikoff, F Piano, M Snyder, L Stein, JD Lieb, RH Waterston (2010). Science 330: 1775-87.

Rewiring of transcriptional regulatory networks: hierarchy, rather than connectivity, better reflects the importance of regulators.
N Bhardwaj, PM Kim, MB Gerstein (2010). Sci Signal 3: ra79.

Extensive in vivo metabolite-protein interactions revealed by large-scale systematic analyses.
X Li, TA Gianoulis, KY Yip, M Gerstein, M Snyder (2010). Cell 143: 639-50.

Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.
ZD Zhang, MB Gerstein (2010). BMC Bioinformatics 11: 539.

A map of human genome variation from population-scale sequencing
1000 Genomes Project Consortium, GR Abecasis, D Altshuler, A Auton, LD Brooks, RM Durbin, RA Gibbs, ME Hurles, GA McVean (2010). Nature 467: 1061-73.

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data.
A Sboner, L Habegger, D Pflueger, S Terry, DZ Chen, JS Rozowsky, AK Tewari, N Kitabayashi, BJ Moss, MS Chee, F Demichelis, MA Rubin, MB Gerstein (2010). Genome Biol 11: R104.

Structured digital tables on the Semantic Web: toward a structured digital literature
KH Cheung, M Samwald, RK Auerbach, MB Gerstein (2010). Mol Syst Biol 6: 403.

Annotating non-coding regions of the genome.
RP Alexander, G Fang, J Rozowsky, M Snyder, MB Gerstein (2010). Nat Rev Genet 11: 559-71.

Segmental duplications in the human genome reveal details of pseudogene formation.
E Khurana, HY Lam, C Cheng, N Carriero, P Cayting, MB Gerstein (2010). Nucleic Acids Res 38: 6997-7007.

Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays.
A Agarwal, D Koppstein, J Rozowsky, A Sboner, L Habegger, LW Hillier, R Sasidharan, V Reinke, RH Waterston, M Gerstein (2010). BMC Genomics 11: 383.

Using semantic web rules to reason on an ontology of pseudogenes.
ME Holford, E Khurana, KH Cheung, M Gerstein (2010). Bioinformatics 26: i71-8.

Analysis of combinatorial regulation: scaling of partnerships between regulators with the number of governed targets.
N Bhardwaj, MB Carson, A Abyzov, KK Yan, H Lu, MB Gerstein (2010). PLoS Comput Biol 6: e1000755.

3V: cavity, channel and cleft volume calculator and extractor.
NR Voss, M Gerstein (2010). Nucleic Acids Res 38: W555-62.

MOTIPS: automated motif analysis for predicting targets of modular protein domains.
HY Lam, PM Kim, J Mok, R Tonikian, SS Sidhu, BE Turk, M Snyder, MB Gerstein (2010). BMC Bioinformatics 11: 243.

Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks
KK Yan, G Fang, N Bhardwaj, RP Alexander, M Gerstein (2010). Proc Natl Acad Sci U S A 107: 9186-91.

Analysis of membrane proteins in metagenomics: networks of correlated environmental features and protein families.
PV Patel, TA Gianoulis, RD Bjornson, KY Yip, DM Engelman, MB Gerstein (2010). Genome Res 20: 960-71.

Network modeling identifies molecular functions targeted by miR-204 to suppress head and neck tumor metastasis.
Y Lee, X Yang, Y Huang, H Fan, Q Zhang, Y Wu, J Li, R Hasina, C Cheng, MW Lingen, MB Gerstein, RR Weichselbaum, HR Xing, YA Lussier (2010). PLoS Comput Biol 6: e1000730.

Getting started in gene orthology and functional analysis.
G Fang, N Bhardwaj, R Robilotto, MB Gerstein (2010). PLoS Comput Biol 6: e1000703.

Analysis of diverse regulatory networks in a hierarchical context shows consistent tendencies for collaboration in the middle levels.
N Bhardwaj, KK Yan, MB Gerstein (2010). Proc Natl Acad Sci U S A 107: 6841-6.

Variation in transcription factor binding among humans.
M Kasowski, F Grubert, C Heffelfinger, M Hariharan, A Asabere, SM Waszak, L Habegger, J Rozowsky, M Shi, AE Urban, MY Hong, KJ Karczewski, W Huber, SM Weissman, MB Gerstein, JO Korbel, M Snyder (2010). Science 328: 232-5.

Molecular sampling of prostate cancer: a dilemma for predicting disease progression.
A Sboner, F Demichelis, S Calza, Y Pawitan, SR Setlur, Y Hoshida, S Perner, HO Adami, K Fall, LA Mucci, PW Kantoff, M Stampfer, SO Andersson, E Varenhorst, JE Johansson, MB Gerstein, TR Golub, MA Rubin, O Andren (2010). BMC Med Genomics 3: 8.

Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates.
ZD Zhang, A Frankish, T Hunt, J Harrow, M Gerstein (2010). Genome Biol 11: R26.

Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique.
N Bhardwaj, M Gerstein, H Lu (2010). BMC Bioinformatics 11 Suppl 1: S6.

Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing.
JQ Wu, L Habegger, P Noisa, A Szekely, C Qiu, S Hutchison, D Raha, M Egholm, H Lin, S Weissman, W Cui, M Gerstein, M Snyder (2010). Proc Natl Acad Sci U S A 107: 5254-9.

Personal genome sequencing: current approaches and challenges.
M Snyder, J Du, M Gerstein (2010). Genes Dev 24: 423-31.

Genome-wide identification of binding sites defines distinct functions for Caenorhabditis elegans PHA-4/FOXA in development and environmental response.
M Zhong, W Niu, ZJ Lu, M Sarov, JI Murray, J Janette, D Raha, KL Sheaffer, HY Lam, E Preston, C Slightham, LW Hillier, T Brock, A Agarwal, R Auerbach, AA Hyman, M Gerstein, SE Mango, SK Kim, RH Waterston, V Reinke, M Snyder (2010). PLoS Genet 6: e1000848.

Deciphering protein kinase specificity through large-scale analysis of yeast phosphorylation site motifs.
J Mok, PM Kim, HY Lam, S Piccirillo, X Zhou, GR Jeschke, DL Sheridan, SA Parker, V Desai, M Jwa, E Cameroni, H Niu, M Good, A Remenyi, JL Ma, YJ Sheu, HE Sassi, R Sopko, CS Chan, C De Virgilio, NM Hollingsworth, WA Lim, DF Stern, B Stillman, BJ Andrews, MB Gerstein, M Snyder, BE Turk (2010). Sci Signal 3: ra12.

Close association of RNA polymerase II and many transcription factors with Pol III genes.
D Raha, Z Wang, Z Moqtaderi, L Wu, G Zhong, M Gerstein, K Struhl, M Snyder (2010). Proc Natl Acad Sci U S A 107: 3639-44.

Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data.
KY Yip, RP Alexander, KK Yan, M Gerstein (2010). PLoS One 5: e8121.

Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library.
HY Lam, XJ Mu, AM Stutz, A Tanzer, PD Cayting, M Snyder, PM Kim, JO Korbel, MB Gerstein (2010). Nat Biotechnol 28: 47-55.

RigidFinder: a fast and sensitive method to detect rigid blocks in large macromolecular complexes.
A Abyzov, R Bjornson, M Felipe, M Gerstein (2010). Proteins 78: 309-24.

Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications
H Yu, R Jansen, G Stolovitzky, M Gerstein (2007). Bioinformatics 23: 2163-73.

Return to front page