2008-summary

As in previous years, we continue to focus on genome annotation, network analysis and macromolecular motions. Innovations in sequencing technology that have led to the next generation sequencing methods are providing genomic data at a breakneck-speed. This has also ushered in a new era of personal genomics where we anticipate large amounts of sequence information for thousands of individual humans. We are developing methods to analyze a large amount of raw genomic sequence data and to analyze experimental data obtained from several large-scale genomic projects. We briefly highlight some of the results from our research below.

The main aim of genome annotation projects is to better understand the function of various genomic features. To this end, we continue to work on pseudogenes, analysis of transcription factor binding sites and transcriptionally active regions (TARs). In the previous years, we developed methods to identify pseudogenes "in silico". We have refined this method and developed an automated pipeline for identification of pseudogenes. This year, we have developed another database called Pseudofam, (http://pseudofam.pseudogene.org), which is a database of pseudogene families based on protein families from Pfam database (Lam et al, 2008). It provides resources for analyzing the family structure of pseudogenes including query tools, statistical summaries and sequence alignments. While it is widely believed that pseudogenes are non-functional, several recent studies seem to suggest otherwise. Recent studies have shown that processed pseudogenes regulate gene expression by means of the RNA interference pathway in mouse oocytes and in Drosphila. We have speculated on the function of pseudogenes as protein fossils that live as RNA and regulate gene expression. This appeared in Nature as a "News and Views" piece (Sasidharan and Gerstein, 2008b). We have also investigated the pseudogenes of the nuclear receptor (NR) family in eight vertebrate species (Zhang et al, 2008c). In contrast to previous large scale pseudogene analysis, all but one pseudogene identified in this study are retropseudogenes and no duplicated NR pseudogenes are found. We also identified a couple of mouse NR pseudogenes containing remnant intronic sequence, an unusal example of a pseudogene derived from a semi-processed RNA transcript.

ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. We developed an "in silico" ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence (Zhang et al, 2008a). Our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data. Based on these results, we refined an existing scoring approach to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

To better understand wide-spread transcription in the human genome, we investigated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing (Wu et al, 2008). We were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.

As part of genome annotation, we also focus on the definition of a "gene". While it is widely used, the word 'gene' does not have a clear definition. Often, the use of the word 'gene' implies a protein-coding DNA. Sometimes, a region of transcription is equated to a gene locus. But with new genomic data, it is clear that transcription abounds in the human genome, and not just in protein-coding regions. Therefore, we bring attention to the fuzzy definition of a "gene" in an article in American Scientist (Seringhaus and Gerstein, 2008a).

In the realm of "personal genomics", we are doing extensive work on copy number variations (CNVs) and structural variations. In addition to SNPs, CNVs are known to contribute substantially to genetic diversity among individuals. Some CNVs have been shown to be associated with diseases. We have analyzed CNVs from a variety of different angles. While non-allelic homologous recombination is the main driver of genome rearrangements, we have shown that recent events are derived through a different mechanism associated with repeats other than Alu and from non-homologous end-joining (Kim et al, 2008a). We have investigated CNVs in olfactory receptors in 25 individuals using high resolution oligonucleotide arrays (Hasin et al, 2008). We identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Moreover, we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. We studied CNVs in relation to protein families and found that CNVs tend to affect specific gene functional categories, such as those associated with environmental response and occur more often at the periphery of the protein interaction network (Korbel et al, 2008). In contrast, protein families associated with successful and unsuccessful duplicates are associated with similar functional categories but are differentially placed in the interaction network.

An important problem in systems biology is reconstructing complete networks of interactions between biological objects by extrapolating from a few known interactions as examples. Based on the concept of training set expansion, we have developed two methods to improve prediction of network structure from limited interactions (Yip et al, 2008). Both methods are based on semi-supervised learning where a limited number of gold-standard training interactions are augmented with carefully chosen high-confidence interactions. We have shown that our method works better than most state of the art prediction algorithms. In another study, we integrated the information from 3D structure of proteins into the topological analysis of protein networks (Kim et al, 2008b). Specifically, we investigated the relationship between protein disorder and network structure. We find that hub proteins tend to be more disordered when they involve only one or two binding interfaces. But in the case of multi-interface hubs, there is no enrichment of disordered proteins.

In the area of macromolecular motions, we have developed HingeMaster, a tool to predict location of hinges in proteins (Flores et al, 2008). We integrated three existing structure-based hinge predictors and one sequence-based predictor into one combined predictor using a weighted voting scheme. We encapsulated all our results in a web tool which can be used to run all the predictors on submitted proteins and visualize the results

We continue to investigate different ways to integrate the large amount of data generated and optimal ways to data mine such resources. To this end, we propose the use of manually structured digital abstracts for efficient automatic text mining (Seringhaus and Gerstein, 2008b). We have also attempted to organize strange and notable gene names (Seringhaus and Gerstein, 2008c). We hope this analysis provides clues to better steer gene naming in the future.

Genomics Confounds Gene Classification

M Seringhaus, M Gerstein (2008). American Scientist 96:466-473 (Nov-Dec)