2007-summary

In 2007, we have made significant progress in our research goals. We continue to focus on genome annotation, network analysis and macromolecular motions. We have developed new tools to process various data arising from functional genomics efforts. We briefly describe some of the results from our research below.

As part of the ENCODE consortium, we have worked on several different computational aspects of identifying functional elements in the ENCODE regions, a representative 1% of the human genome (Birney et al, 2007). Some examples of this analysis include identification of STAT1 binding regions using both ChIP-chip and ChIP-PET technologies. This study provides information about scoring and validation of Chip-based technologies (Euskirchen et al, 2007). We have also statistically analyzed the genomic distribution of transcriptional regulatory sites utilizing the wealth of data generated by the ENCODE projects (Zhang et al, 2007a). We developed a tool for classification of transcriptionally active regions using several features such as expression levels, their genomic location relative to genes, sequence composition and phylogenic conservation (Rozowsky et al, 2007). This tool will allow one to classify TARs and will aid in the design of further experiments to understand the function of many RNAs transcribed in human cells.

Also, in terms of genome annotation, we have worked extensively on several aspects of genome analysis. We developed a high throughput massive paired end mapping sequencing technology method and a computational pipeline to identify large-scale structural variations in the human genome, variations ~3kb or larger (Korbel et al, 2007a). We characterized interspecies diversity in gene regulation by doing a ChIP-Chip analysis of transcription factors Ste12 and Tec1 in 3 yeast species and an evolutionary comparison of binding sites of conserved TF sites across the 3 species (Borneman et al, 2007a). Despite close sequence similarity, this analysis showed that extensive binding site difference probably explains the differences between these organisms that adapt to their distinct ecological niche. Both these papers were published in Science.

On the pseudogene front, we have updated the pseudogene.org database (Karro et al, 2007). We provide tools for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. This database now includes pseudogene data for 64 prokaryotes and 11 eukaroyotes and new genomes are constantly added as new data are obtained. We have also done a very careful annotation of pseudogenes in the ENCODE regions and have studied the prevalence of transcription in pseudogenes based on carefully designed RACE experiments. These results in conjunction with tiling array data and high-throughput sequencing indicate that a fifth of the pseudogenes are transcriptionally active (Zheng et al, 2007a).

We have published several papers on network analysis and highlight a few here. Based on a network analysis of Hsp90 and its interacting proteins, we characterized proteins that need help in folding relative to the background (McClellan et al, 2007). Graph theoretic analysis of Hsp90 targets in relation to the yeast target network in combination with GO functional annotation suggested a role for Hsp90 in cellular trafficking and transport as well as in cell cycle regulation. We were able to identify potential targets for experimental studies. Lu et al (2007) compare classical pathways and networks and identify the inadequacy of current definitions to adequately convey all information in pathways. We propose a prototype for edge ontology to overcome this deficiency in the current edge representation. Finally, in Kim et al. (2007), we showed that one tends to find positively selected proteins on the network periphery -- suggesting how the human variation is arranged with respect to the interactome.

Using the Molecular Motions Database, we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges (Flores et al 2007). We have also developed FlexOracle, a new hinge prediction method to study hinge motions in proteins (Flores and Gerstein, 2007). This is based on the idea that energetic interactions are stronger within structural domains than between them, and that fragments generated by cleaving the protein at the hinge site are independently stable.

Finally, we have published several papers focusing on more efficient ways of data storage, information retrieval and dissemination. We have developed LinkHub, a semantic web-based system that facilitates cross-database queries and information retrieval in proteomics (Smith et al, 2007b). Given the deluge of data obtained from genome-scale experiments, it is evident that the traditional ways of disseminating information in the form of a published article is neither efficient nor complete. We suggest improvements in data dissemination and propose a new information architecture with an expansive central index (Seringhaus and Gerstein, 2007). Articles suitable for digital parsing and structured digital abstracts will improve text mining capabilities and bridge the gap between the traditional route of data sharing in the form of a published paper and huge data dumps in databases.

Semantic Web Approach to Database Integration in the Life Sciences

KH Cheung, AK Smith, KYL Yip, CJO Baker, MB Gerstein (2007). in Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences (eds. C Baker and K Cheung, Springer, NY), pp. 11-30