Summary of Gerstein Lab Research in 2014
High throughput genomics has become the mainstay of genomics research due to the pace of advances in sequencing technology. New and innovative genome-wide screens for probing the human genome are continuously developed. Large-scale computational analyses go hand in hand with this development. To keep up with the dynamic and accelerated pace of discovery, in 2014 we were involved in all areas of "big data" analysis: automated pipelines for streamlining big data analysis, algorithm and methods development, analysis and functional interpretation of genetic variants. Specifically, our research is focused on comparative and functional genomics, discovery of genomic variations in human disease, data integration and regulatory networks. Here we briefly present our recent publications and new computational genomics tools we developed. We are actively participating in several large consortium projects, such as 1000 Genomes, BrainSpan, Extracellular RNA Communication, psychENCODE, kbase and ENCODE.
Overall in 2014 we had 12 publications.
* Many of these were related to the comparative ENCODE presentation, which was featured in a Yale press release and the NY Times (see info.gersteinlab.org/cmptxnpress). In particular:
1) A large-scale transcriptome analysis [PMID:25164755 Gerstein et al. 2014] across these three species revealed co-expression modules shared in animals and enriched in their developmental genes. We introduced a so-called 'universal model' for quantitative prediction of coding and non-coding gene expression levels from chromatin features at the promoter. The model is based on a single set of organism-independent parameters and in the three organisms, achieved accuracy comparable to the organism-specific models.
2) Capitalizing on the vast amount of uniformly processed experimental data obtained by the ENCODE and modENCODE consortia, we performed a series of fundamental large-scale comparative genomics studies across distant metazoan phyla. A comparative analysis of basic principles of transcriptional regulatory features in human, worm, and fly cells (at different developmental stages and conditions) revealed remarkable conservation of general structural properties of regulatory networks despite extensive divergence of individual network features. [PMID:25164757 Boyle et al. 2014]
3) Another study was centered on a multi-organism comparison of pseudogenes and revealed that pseudogenes are much more lineage specific than protein-coding genes, reflecting the different genome remodeling processes in each organism's evolution. [PMID:25157146 Sisu et al. 2014]
4) We developed a new comparative genomics tool, OrthoClust, [PMID:25249401 Yan et al. 2014] for simultaneously clustering data across multiple species. OrthoClust integrates the co-association networks of individual species utilizing the orthology relationships of genes between species. It has been used to obtain co-expression modules from worm and fly RNA-Seq expression profiles.
* We had additional works in genome annotation, viz:
1) Continuing our efforts on advancing ChIP-Seq (the mainstream experimental method for genome-wide identification of transcription factor binding and chromatin modification sites) analysis techniques, we introduced a new signal-processing algorithm, MUSIC [PMID:25292436 Harmanci et al. 2014], for identification of enriched regions in ChIP-Seq experiments. MUSIC utilizes multi-scale decomposition of the signal profile and allows identification of broad enrichment domains. It performs favorably compared to other published methods.
2) In a collaborative perspective [PMID:24753594 Kellis et al. 2014] we investigate advantages and limitations of different approaches for identifying functional DNA segments -- an important area which has attracted much attention and effort (such as the ENCODE Project) since the completion of human genome sequence.
* We had a number of papers related disease and cancer genomics, viz:
1) Advancing the functionality of our recently developed Function-based Prioritization of Sequence Variants (FunSeq) tool [PMID: 24092746 Khurana et al. 2013] for identification of candidate drivers in tumor genomes, this year we introduced a more elaborative and flexible framework, FunSeq2, [PMID:25273974 Fu et al. 2014] integrating various genomic and cancer resources to prioritize cancer somatic variants, especially regulatory noncoding mutations.
2) We contributed to the development of another computational framework with potential application to cancer genomics: VarSim [PMID:25524895 Mu et al. 2014] is a pipeline for assessment and validation of alignment and variant calling accuracy in high-throughput genome sequencing through simulation or real data. An enhanced version of vcf2diploid, [PMID:21811232 Rozowsky et al. 2011] - a program developed previously in our lab - is now used as a part of this pipeline to generate a diploid genome with simulated variants.
3) Next Generation DNA sequencing provided investigators with the opportunity to discover genetic variants in human disease. We participated in formulation of recommendations and general guidelines summarizing confidence in variant pathogenicity and focusing on aspects of study design, gene-level and variant-level implications, annotation of causality in public databases, and implications for clinical diagnosis. [PMID:24759409 MacArthur et al. 2014]
* We had a number of collaborative papers, particularly related to neuro-transcriptomics, viz:
1) We were involved in transcriptome analysis of the prenatal human brain, a part of a large project - BrainSpan Atlas of the Developing Human Brain, - aimed towards understanding of human brain development and identifying roots of neurodevelopmental and psychiatric disorders. [PMID:24695229 Miller et al. 2014]
2) In collaboration with the Turk group at Dept. of Pharmacology, YSM, we looked into phosphoacceptor site specificity of serine/threonine-specific protein kinases. The study [PMID:24374310 Chen et al. 2014] revealed that a single kinase residue determines the preference for either serine or threonine, thus allowing one predict specificity of this class of proteins based on amino acid sequence.
3) Finally, we presented a review on neuroproteomics (in collaboration with Prof. Nairn, Division of Molecular Psychiatry, YSM), [PMID:25349915 Kitchen et al. 2014] discussing the benefits and perspectives of integration of proteomic and functional genomic data, as well as experimental strategies to achieve cell-type specificity in transcriptomic and proteomic studies of neural tissues.