Technology to probe the human genome in a high-throughput fashion is growing rapidly and resequencing of human genomes is becoming routine. However, the sheer number of DNA variations between people makes it a daunting task to look for the proverbial needle in the haystack that point to interesting biology. Identifying variants that are most likely to impact function is therefore crucial to begin to understand their effects. The first step towards this goal is the functional annotation of the variant. Our research continues to focus on genomic annotation, understanding genetic variants, developing methods for processing different kinds of functional genomics data and integrating different kinds of data to make sense of 'big data'. Here we highlight a few vignettes that tie together annotation, prediction methods and integration of data to understand the effect of genetic variants.
One of the priorities in genome analysis is to identify variants that could lead to disease or explain pathogenesis mechanisms. Previous studies have achieved this with reasonable success by looking at sequence conservation based on cross-species conservation. Other approaches include analysis of protein-protein interaction networks to identify genes that are most likely to be disease-related. However, integration of different kinds of data leads to improvements in identification of potential causal variants. We have developed an improved network-based approach to prioritize candidate genes for identification of disease-causing genes based on combining many different kinds of network (PMID: 23505346, Khurana et al., 2013a). Most published methods rely on protein-protein interaction networks. In order to improve the prediction accuracy and overcome the problem due to missing data, we have combined many networks and built a unified network called Multinet. Specifically, we combined the protein-protein interaction network, the regulatory network derived from large-scale transcription factor binding experiments published by ENCODE consortium, the signaling network, the phosphorylation network, the metabolic network and the genetic interaction network. We used the integrated network data along with other features to build a classi fier to distinguish between disease-causing genes and genes tolerant to loss-of-function. Our method provides very good discrimination between the two and will be applicable for disease gene discovery projects.
Integrating heterogeneous datasets can lead to a better understanding of the human genome. Machine learning approaches can be used to extract meaning from the multi-dimensional and disparate genomic datasets. A prerequisite to understanding the genome-scale datasets is the systematic annotation of the genome under consideration. We have therefore developed a primer about machine learning methods and shown how it can be used for genome annotation. This primer in Genome biology focuses on identification and classification of genomic elements and is intended for use as an introduction to machine learning (PMID: 23731483, Yip et al., 2013).
We developed a method to predict expression of genes in yeast regulated by cell cycle (PMID: 23895232, Cheng et al., 2013). Based on a penalized logistic regression model, we were able to classify genes as those that are regulated by cell cycle and those that are not cell cycle dependent. We used genomic features for the training dataset. Specifically, we used 203 transcription factors binding and 537 motif-matching profiles for each gene as features for the prediction. This method can be extended to study human cell cycle process and other processes regulated by transcription.
It is clear from the deluge of human resequencing data that the current human genome is not representative of all humans. In particular, long stretches of DNA, can be variable between people in terms of either being present or absent. Such structural variants and copy number variants are a relatively young area of genomic research. We developed a method to identify retroduplications (RDs) that are variable between people (PMID: 24026178, Abyzov et al., 2013). Retroduplications are essentially retrotranposed mRNA segments that are not present in the reference genome and are variable amongst humans. RDs can either be retrogenes or processed pseudogenes. We found that RDs are derived from genes that tend to be expressed at the M to G1 transition in the cell cycle suggesting that cell division is perhaps coupled to retrotransposition.
We have been working on understanding patterns of natural variation in human genomes in order to identify deviations from this pattern in disease cases (PMID: 24092746, Khurana et al., 2013b). We identified regions in the genome under purifying selection that are enriched for rare alleles using variation data from 1092 individuals (Phase1 of 1000 genomes project). Such regions that we dubbed sensitive and ultrasensitive regions highlight regions that are under strong constraint. Identification of variations in such regions in disease genomes allows us to pick out candidate disease-causing variants from the large background of neutral and benign variations. We have leveraged the functional genomics data from ENCODE to functionally annotate the noncoding regions and used our method to identify potential noncoding driver mutations in cancer genomes.