The rapid progress in next-generation sequencing technology has led to a data deluge. Genomics research has switched from a hypothesis-driven science to data-driven analyses of high throughput data. This has provided us with tremendous opportunities for characterization of the human genome at multiple levels: DNA (genomic), RNA (transcriptomic) and protein (proteomic) level. For the first time, biological research is limited by rate of data analysis rather than data generation. In order to bridge this gap and extract useful information from this wealth of data, we are involved in large consortia efforts to characterize the human genome and other model organisms. We continue to develop new methods for assembly, annotation and functional characterization of the genome.
Comprehensive annotation is the cornerstone for genomic analyses. A high-quality annotation set is essential for correct interpretation of functional genomics studies. We have been involved in systematic genome annotation for several years now. Coding regions have historically been the subject of intensive annotation efforts. Nonetheless, the function of some protein-coding genes is still unknown. A commonly used method to infer function of such genes is to look for functionally well-characterized homologs in other organisms. Identification of orthologs is not straightforward. We have elaborated on this aspect in a Plos Computational Biology review (Fang et al., 2010). In this article, we discuss the general procedures used to identify orthologs and focus on the functional analyses of orthologs and make suggestions to construct better ortholog groups. Construction of ortholog groups is fundamental to many objectives, such as transferring annotation to newly sequenced genomes, and pathway comparisons across species.
Most of the human genome consists of DNA that does not code for proteins. The role of such noncoding regions had been previously neglected. But the behemoth amounts of functional genomics data clearly indicate that noncoding regions have important regulatory roles. We are working on various kinds of noncoding genome annotations. Previously, we have focused on identification and annotation of pseudogenes and developed computational methods such as PeakSeq and ChIP-Seq to identify transcription factor binding sites. In addition to pseudogene annotation, we are involved in annotation and functional interpretation of noncoding regions of the genome. We briefly highlight our various research directions in genome annotation below.
On the pseudogene front, we have identified 76 human-specific pseudogenes called unitary pseudogenes that have genic counterparts in other organisms (Zhang et al., 2010a). Thus, these represent human-specific gene inactivation events. Based on an analysis of the segmental duplications and pseudogenes, we have also identified 140 novel duplicated pseudogenes leading to an improved annotation for the 3172 pseudogenes located in segmental duplications based on an analysis of the interrelationship of segmental duplications and pseudogenes (Khurana et al., 2010).
Annotating functional regions in the non-coding genome involves two complementary analysis techniques: comparative analysis, which involves examining DNA sequences, and functional analysis, which involves examining the output of functional genomics experiments. We have comprehensively discussed the annotation of noncoding regions of the genome in a Nature Genetics Review (Alexander et al., 2010).
We have developed FusionSeq, a modular computational framework that identifies fusion transcripts from paired-end RNASEQ data (Sboner et al., 2010). It includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. In addition, it has a module to identify exact sequences at breakpoint junctions. We used FusionSeq to detect known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.
We are involved in several large-scale annotation projects and describe below the main activities in the year 2010.
1. The 1000 genomes project (http://1000genomes.org/)
This is essentially NIH's marquee effort on personal genomics, the sequencing of individual people's genomes. In October 2010, we completed our pilot project on ~200 people, which was published in Nature (1000 Genomes Consortium paper, 2010). We are involved in developing pipelines for analyzing the massive amounts of sequence information as part of the 1000 genomes consortium. We are extensively involved in three sub-groups within 1000 genomes: a. Structural variation (SV) b. Loss of function (LOF) c. Indel calling methods and analyses.
We have pioneered the development of methods for detection and precise mapping of SVs. Given the flood of personal genome sequencing efforts, defining the precise location of SVs at single-nucleotide breakpoint resolution is essential for reconstructing personal genome sequences and inferring the functional impact of SVs. To this end, we assembled a library of breakpoints at nucleotide resolution by collating and standardizing ~2,000 published SVs We characterized breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties. We then developed BreakSeq, a method for detecting SVs by aligning raw reads directly onto SV breakpoint junctions of the alternative, nonreference, alleles contained in our library (Lam et al., 2010a). Application of BreakSeq on a large scale can be used for genotyping and determining SV allele frequencies. In addition, we have recently developed several algorithms for SV discovery. These include a Bayesian statistical analysis algorithm for the detection of CNVs from array-CGH and recent whole genome sequencing based on next generation technologies (Zhang et al., 2010b). Using these different approaches, we built a high quality SV dataset for the 1000 genomes project which has been recently published in Nature (Mills et al., 2011).
2. modENCODE (http://www.modencode.org/)
modENCODE is a NIH initiative to identify all functional elements in model organisms. We spearheaded this multi-institute project for the model organism Caenorhabditis elegans. Using a variety of tools and methods, we have annotated and elucidated the functional elements of most of the conserved genome of C. elegans. This work was recently published in Science (Gerstein et al., 2010).
We continue to work on understanding the myriad interactions between proteins and other molecules in the cell and are attempting to decipher the interplay between the different entities to better understand cellular function. To this end, we reconstructed regulatory networks by combining complementary data: deletion data and time series data of gene expression after some initial perturbation (Yip et al., 2010). Comparison between our predicted networks and the actual networks shows that integrating different types of data is an effective way to predict regulatory networks.
We also compared the transcriptional regulatory network of Escherichia coli to the network depiction, call graph of a canonical OS (Linux) in terms of topology and evolution (Yan et al., 2010). We showed that even though both networks have a hierarchical layout, the transcriptional regulatory network possesses a few global regulators at the top and many targets at the bottom; whereas, the call graph has many regulators controlling a small set of generic functions. This leads to highly overlapping functional modules in the call graph, in contrast to the relatively independent modules in the regulatory network. The process of biological evolution via random mutation and subsequent selection tightly constrains the evolution of regulatory network hubs. The call graph, however, exhibits rapid evolution of its highly connected generic components, made possible by designers' continual fine-tuning. These findings arise from the design principles of the two systems: robustness for biological systems and cost effectiveness (reuse) for software systems.
We analyzed combinatorial regulation based on co-transcription and co-phosphorylation networks across five species varying from E. Coli to human. We found that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets (Bhardwaj et al., 2010a). We also analyzed the effects of network rewiring events on transcriptional regulatory hierarchies in two species: E.Coli and S. Cerevisiae (Bhardwaj et al., 2010b). We found that rewiring events that affected upper levels had a more marked effect on cell proliferation rate and survival than those involving lower levels. We showed that the hierarchical level and type of change better reflected the phenotypic effect of rewiring than did the number of changes.
Macromolecular structure and function
We developed a novel method, RigidFinder, for identification of rigid blocks from different conformations-across many scales, from large complexes to small loops (Abyzov et al., 2010). RigidFinder defines rigidity in terms of blocks, where inter-residue distances are conserved across conformations. Unlike most other methods that use averaged values such as RMSD, to identify motions, our method is based on distance conservation that allows for sensitive identification of motions.
Using global ocean sampling (GOS) data, we found approximately 900,000 membrane proteins in large-scale metagenomic sequence (Patel et al., 2010). About a fifth of these membrane proteins are novel proteins. We show that there is widespread variation in membrane protein content across marine sites, which is correlated with changes in both oceanographic variables and human factors. For instance, we find that the occurrence of iron transporters is connected to the amount of shipping, pollution, and iron-containing dust.
Many protein interactions involve short linear motifs consisting of 5-10 amino acid residues that interact with modular protein domains such as the SH3 binding domains and the kinase catalytic domains (Lam et al., 2010b). We have developed an efficient search algorithm to scan the target proteome for potential domain targets and to increase the accuracy of each hit by integrating a variety of pre-computed features, such as conservation, surface propensity, and disorder.