Best2006

Genomics: Pseudogenes

We have performed comprehensive surveys of pseudogenes, in terms of protein families, for worm, yeast, and human organisms. Using these analyses, we have addressed important evolutionary questions about the type of proteins that existed in the past history of an organism. In Harrison et. al. (2001), we described >2,000 pseudogenes in the worm genome, which we were able to associate with environmental-response protein families. In Harrison et. al. (2002), we continued this work, focusing on the yeast genome. We found that the pseudogenes were not only associated with environmental response proteins, but we also proposed a possible mechanism whereby the pseudogenes could be resurrected, providing a reservoir of diversity for the organism. In Zhang et al. (2003), we comprehensively assigned pseudogenes to the human genome, finding more than 8000 high-confidence processed pseudogenes. Processed pseudogenes, unlike the duplicated variety found in the yeast and the worm, tend to be associated with highly expressed proteins, such as ribosomal proteins. This study has allowed us to use pseudogenes to study the natural rate of mutation and variation in the human genome in subsequent work.

Genomics: Tiling Array Analysis of Intergenic Regions

We have analyzed the activity of intergenic regions using tiling array technology. In Luscombe et al. (2003), we presented a tool that we use for efficiently processing large tiling array data sets over the web. In Bertone et al. (2004), we reported the first analysis of a whole human genome tiling array, finding large amounts of transcription in intergenic regions. In Zheng et al. (2005), we further analyzed the transcription in these intergenic regions by evaluating the degree to which the observed transcription intersected with pseudogenes. We discovered hints that many supposedly "deadi" pseudogenes may actually have some activity.

Proteomics: Large-scale Prediction of Function and Networks

We have created a systematic approach for integrating different sources of information to predict protein networks and other aspects of protein function. The first paper, Drawid & Gerstein (2000), described a probabilistic system for predicting the subcellular localization of proteins in a Bayesian framework. The second paper, Jansen et al. (2002), analyzed the relationship between expression correlations and protein-protein interactions. We then built upon these two papers in Jansen et al. (2003), where we used data integration of many genomic features to make large-scale predictions of protein-protein interactions in yeast. A number of these predictions were subsequently verified by experiment.

Proteomics: Analysis of Protein Networks

We have studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g. the diameter) and on a small-scale in terms of local network motifs, and used these studies to help circumscribe the function of particular genes. In Luscombe et al. (2004), we analyzed the changing structure of the yeast regulatory network in different conditions. Our results showed that the regulatory network changed greatly in its topological properties, such as a usage of motifs and hubs. In Douglas et al. (2005), we introduced a new tool called PubNet, which allows one to build and analyze networks derived both from gene relationships and from literature citations.

Structural genomics: Fold Usage in Different Organisms

These papers highlight the work we have done in structural genomics, looking at large scale descriptions of protein structure. In Gerstein (1997) we published one of the first papers looking at the patterns of fold and structural usage across different organisms. We followed this up in Lin & Gerstein (2000), where we used fold and family usage to classifying organisms in trees in terms of the number of parts they share. In Qian et al. (2001) we developed a statistical model to explain the power-law occurrence of folds and families in different genomes.

Structural genomics: Mining Outcomes

We developed a widely-used database (SPINE) to track the research outcomes from different labs that comprise a large structural genomics collaboration (NESG). Using data mining techniques, we were the first to identify the characteristics of proteins that tend to be most tractable for structural analysis (Bertone et al., 2001).

Structural genomics: Structure-Function Relationships

We have examined the relationship between structure and function on a large-scale. In Hegyi & Gerstein (1999), we comprehensively analyzed the structural database to measure the (weak) association between specific folds and specific functions. In Wilson et al. (2000), we looked at the degree to which sequence similarity between proteins within a given fold was predictive of function, as well as the degree to which sequence similarity was also proportional to precise 3D-structural similarity (in the sense of RMS deviation).

Analysis of Macromolecular Motions in terms of Packing

These two papers are representative of our work analyzing in detail the 3D structures of proteins and relating their motions to packing. Gerstein & Krebs (1998) report a comprehensive database on protein motions, which is associated with a web tool for dynamically simulating possible trajectories. This database has received great usage. Tsai et al. (1999) report a carefully refined set of atomic packing parameters radii for proteins and showed how the set can be used in a variety of contexts.

Broad Informatics Issues Impacting on Biology

As part of our mission to connect biology with computation, we have also extensively analyzed how a number of larger issues relating to computation in society impact upon biomedicine. In Greenbaum et al. (2004), for instance, we identified an important aspect of how computer security concerns impede database interoperation and, ultimately, limit large-scale genomics collaborations.

Significant Reviews

We have written a number of articles that provided an overview of our research topics. In particular, in Gerstein et al. (2002), we reviewed the area of integrating evidence to predict protein-protein interactions. In Snyder & Gerstein (2003), we talked about how one define genes and pseudogenes on a large-scale in the new genomics era. In Teichmann et al. (1999), we surveyed the potential for structural coverage of a small genome before this attempted by NIH Structural Genomics Centers.

PubNet: a flexible system for visualizing literature derived networks.

SM Douglas, GT Montelione, M Gerstein (2005). Genome Biol 6: R80.