Summary of Gerstein Lab Research in 2011


Plummeting sequencing costs continues to open up exciting new avenues of research for probing the human genome in a high-throughput fashion. In addition to whole-genome sequencing, RNASEQ, Chip-Seq and other new methods are adding to the data deluge triggered by advances in next generation sequencing technologies. While the relentless pace of data accumulation provides us with exciting opportunities to better understand the human genome, this requires development of new tools and algorithms to analyze and interpret the human genome.The current 'big data' era also brings with it challenges in terms of data storage, secure and responsible use of data and privacy issues. We are developing new tools to analyze personal genomes focusing on genome annotation, functional annotation and interpretation of genome, data integration, network analyses and protein structure, function and dynamics. We also address issues related to cost of sequencing and privacy concerns..


We have developed several tools and pipelines for data analyses. Here is a brief synopsis:

1. RSEQtools: Analysis of RNA-Seq experiments ( These tools consist of a set of modules that perform common tasks, such as calculating gene and exon expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions (Habegger et al. 2011).

2. CNVnator, can be used for discovery and genotyping of CNVs (Abyzov et al., 2011a). It is based on read-depth analysis of personal genome sequencing (

3. AGE, Alignment with Gap Excision: A tools that implements an algorithm for optimal alignment of sequences with SVs (Abyzov et al., 2011b). This allows us to precisely define breakpoints and clearly define personal genome sequence (

4. SRIC utilizes split-read identification for calling SVs and identifying indels (Zhang et al., 2011).

5. ACT: An aggregation and correlation tool for high-throughput genomic data ( (Jee et al., 2011).

6. Alleleseq: A pipeline to elucidate the differences arising due to differences in paternal versus maternal allele ( (Rozowsky et al., 2011).

Annotating personal genomes

The concept of 'a human reference' is questionable in the light of inter-individual variations. Gene annotations based on one reference genome are thus inadequate and sometimes erroneous. We highlight the problem associated with using a reference genome and propose a solution to overcome this problem by using ancestral alleles to define the reference sequence (Balasubramanian et al., 2011).

Based on whole genome sequencing data from 185 individuals (Pilot phase of 1000 genomes), we characterized SVs that include 22,025 deletions and 6,000 additional events, including insertions and tandem duplications. This represents the first large-scale resource of SVs and can be used for sequencing-based association studies (Mills et al., 2011).

While it is known that non-coding elements such as transcription-factor binding sites, TFBSes, will be under selection due to its functional importance, selection in noncoding regions has not been studied extensively. We developed a framework to annotate and identify elements in non-coding regions under selection using the wealth of variation data, SNPs, indels and SVs from the 1000 genomes Pilot phase data. Using various population-based metrics, we compared different classes of noncoding DNA. We found several regions under selection such as TFBSes, microRNAs and their target sites relative to adjoining DNA (Mu et al., 2011).

We found widespread allele-specific expression/binding in one European individual (NA12878) based on combining variation data for NA12878 from the 1000 genomes project and RNASEQ and Chip-Seq data (Rozowsky et al., 2011).


Development of methods to understand networks and compare different kinds of networks are important to understand the interplay between various factors. We developed a formalism to quantify differences between networks. We found that different types of biological networks consistently rewire at different rates (Shou et al., 2011).

We overlaid dynamic conformational changes based on 3D protein structures onto protein-protein interaction network and built a Dynamic Structural Interaction Network (DynaSIN). We show that multi-interface hubs display a greater degree of conformational change than singlish-interface ones (Bhardwaj et al., 2011b). We also find that transient associations involve smaller conformational changes than permanent ones.

We integrated three different regulatory networks: TF->gene, TF->miRNA and miRNA->gene to understand multi-level regulation in higher eukaryotes. These networks were based on RNASEQ and Chip-Seq data on C. elegans as part of modENCODE project. This analysis showed that transcription factors downstream of the hierarchy are expressed more uniformly at various tissues, have more interacting partners, and are more likely to be essential (Cheng et al. 2011a).

Protein structure, function and dynamics

We predict specificity of peptide recognition domains based on coevolution analysis. Essentially, this method identifies residues in peptide binding domains that exhibit evolutionary co-variation with some positions of the bound peptides. Our predictions agree well with published experimental results (Yip et al., 2011).

We developed a method to predict ligand-binding motions in proteins with hinge bending domains (Flores et al., 2011). Using MD simulations of protein and ligand, we generate ligand-bound conformations. The method works well in predicting a ligand-bound structure given the structure of the unbound protein.

Analysis and interpretation

We predict yeast transcription factor targets by integrating histone modification profiles with transcription factor binding motif information (Cheng et al., 2011b). We also developed a method called TIP, target identification of profiles, which quantitatively measures the regulatory relationships between TFs and their target genes (Cheng et al., 2011c).

We developed a statistical formalism to study the relationship between gene expression and chromatin features. Using Chip-seq and RNASEQ data from modENCODE C.elegans data, we developed a model to predict gene expression levels based on chromatin features (Cheng et al., 2011d).

We developed an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs) by combining expression data with RNA secondary-structure stability and evolutionary conservation, at the protein and nucleic-acid level in the worm (Lu et al., 2011). This can be adapted to other organisms.

Perspectives in the 'big-data era'

This is the age of rapid information transfer via the internet. We studied the rate at which information is spread by looking at web statistics of PLOS article level metrics (Yan et al., 2011). The spread had two distinct phases: one corresponding to the short-term fame of a paper and another related to long-term citation statistics.

With the continual drop in sequencing costs, there is much talk about the $1000 genome. While it is indeed true that the cost of sequencing is coming down rapidly, computational infrastructure needed to handle the huge amount of sequencing data and the costs associated with downstream analyses of this data highlight the erroneous price tag associated with a sequencing a human genome (Sboner et al., 2011).

The large scale of personal genomic sequencing data makes it difficult to share data without compromising privacy. We discuss potential breaches of privacy in personal genomic data due to analyses that can lead to identity of individual. This can also indirectly affect family members because their genome content can be deconvoluted from one individual family member's genome. We describe various ways to overcome this problem and suggest that cloud computing may allow access to the data in a more controlled fashion. We also discuss ways to educating the public about these issues (Greenbaum et al., 2011a,b).

Return to front page