Genomics: Pseudogenes

We have performed comprehensive surveys of pseudogenes, in terms of protein families, for worm, yeast, and human organisms. Using these analyses, we have addressed important evolutionary questions about the type of proteins that existed in the past history of an organism. In Harrison et. al. (2001), we described >2,000 pseudogenes in the worm genome, which we were able to associate with environmental-response protein families. In Harrison et. al. (2002), we continued this work, focusing on the yeast genome. We found that the pseudogenes were not only associated with environmental response proteins, but we also proposed a possible mechanism whereby the pseudogenes could be resurrected, providing a reservoir of diversity for the organism. In Zhang et al. (2003), we comprehensively assigned pseudogenes to the human genome, finding more than 8000 high-confidence processed pseudogenes. Processed pseudogenes, unlike the duplicated variety found in the yeast and the worm, tend to be associated with highly expressed proteins, such as ribosomal proteins. This study has allowed us to use pseudogenes to study the natural rate of mutation and variation in the human genome in subsequent work.

Genomics: Tiling Array Analysis of Intergenic Regions

We have analyzed the activity of intergenic regions using tiling array technology. In Luscombe et al. (2003), we presented a tool that we use for efficiently processing large tiling array data sets over the web. In Bertone et al. (2004), we reported the first analysis of a whole human genome tiling array, finding large amounts of transcription in intergenic regions. In Zheng et al. (2005), we further analyzed the transcription in these intergenic regions by evaluating the degree to which the observed transcription intersected with pseudogenes. We discovered hints that many supposedly "deadi" pseudogenes may actually have some activity.

Proteomics: Large-scale Prediction of Function and Networks

We have created a systematic approach for integrating different sources of information to predict protein networks and other aspects of protein function. The first paper, Drawid & Gerstein (2000), described a probabilistic system for predicting the subcellular localization of proteins in a Bayesian framework. The second paper, Jansen et al. (2002), analyzed the relationship between expression correlations and protein-protein interactions. We then built upon these two papers in Jansen et al. (2003), where we used data integration of many genomic features to make large-scale predictions of protein-protein interactions in yeast. A number of these predictions were subsequently verified by experiment.

Proteomics: Analysis of Protein Networks

We have studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g. the diameter) and on a small-scale in terms of local network motifs, and used these studies to help circumscribe the function of particular genes. In Luscombe et al. (2004), we analyzed the changing structure of the yeast regulatory network in different conditions. Our results showed that the regulatory network changed greatly in its topological properties, such as a usage of motifs and hubs. In Douglas et al. (2005), we introduced a new tool called PubNet, which allows one to build and analyze networks derived both from gene relationships and from literature citations.

Structural genomics: Fold Usage in Different Organisms

These papers highlight the work we have done in structural genomics, looking at large scale descriptions of protein structure. In Gerstein (1997) we published one of the first papers looking at the patterns of fold and structural usage across different organisms. We followed this up in Lin & Gerstein (2000), where we used fold and family usage to classifying organisms in trees in terms of the number of parts they share. In Qian et al. (2001) we developed a statistical model to explain the power-law occurrence of folds and families in different genomes.

Structural genomics: Mining Outcomes

We developed a widely-used database (SPINE) to track the research outcomes from different labs that comprise a large structural genomics collaboration (NESG). Using data mining techniques, we were the first to identify the characteristics of proteins that tend to be most tractable for structural analysis (Bertone et al., 2001).

Structural genomics: Structure-Function Relationships

We have examined the relationship between structure and function on a large-scale. In Hegyi & Gerstein (1999), we comprehensively analyzed the structural database to measure the (weak) association between specific folds and specific functions. In Wilson et al. (2000), we looked at the degree to which sequence similarity between proteins within a given fold was predictive of function, as well as the degree to which sequence similarity was also proportional to precise 3D-structural similarity (in the sense of RMS deviation).

Analysis of Macromolecular Motions in terms of Packing

These two papers are representative of our work analyzing in detail the 3D structures of proteins and relating their motions to packing. Gerstein & Krebs (1998) report a comprehensive database on protein motions, which is associated with a web tool for dynamically simulating possible trajectories. This database has received great usage. Tsai et al. (1999) report a carefully refined set of atomic packing parameters radii for proteins and showed how the set can be used in a variety of contexts.

Broad Informatics Issues Impacting on Biology

As part of our mission to connect biology with computation, we have also extensively analyzed how a number of larger issues relating to computation in society impact upon biomedicine. In Greenbaum et al. (2004), for instance, we identified an important aspect of how computer security concerns impede database interoperation and, ultimately, limit large-scale genomics collaborations.

Significant Reviews

We have written a number of articles that provided an overview of our research topics. In particular, in Gerstein et al. (2002), we reviewed the area of integrating evidence to predict protein-protein interactions. In Snyder & Gerstein (2003), we talked about how one define genes and pseudogenes on a large-scale in the new genomics era. In Teichmann et al. (1999), we surveyed the potential for structural coverage of a small genome before this attempted by NIH Structural Genomics Centers.

PubNet: a flexible system for visualizing literature derived networks.
SM Douglas, GT Montelione, M Gerstein (2005). Genome Biol 6: R80.

Integrated pseudogene annotation for human chromosome 22: evidence for transcription.
D Zheng, Z Zhang, PM Harrison, J Karro, N Carriero, M Gerstein (2005). J Mol Biol 349: 27-45.

Global identification of human transcribed sequences with genome tiling arrays.
P Bertone, V Stolc, TE Royce, JS Rozowsky, AE Urban, X Zhu, JL Rinn, W Tongprasit, M Samanta, S Weissman, M Gerstein, M Snyder (2004). Science 306: 2242-6.

Genomic analysis of regulatory network dynamics reveals large topological changes.
NM Luscombe, MM Babu, H Yu, M Snyder, SA Teichmann, M Gerstein (2004). Nature 431: 308-12.

Computer security in academia-a potential roadblock to distributed annotation of the human genome.
D Greenbaum, SM Douglas, A Smith, J Lim, M Fischer, M Schultz, M Gerstein (2004). Nat Biotechnol 22: 771-2.

Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.
Z Zhang, PM Harrison, Y Liu, M Gerstein (2003). Genome Res 13: 2541-58.

A Bayesian networks approach for predicting protein-protein interactions from genomic data.
R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003). Science 302: 449-53.

ExpressYourself: A modular platform for processing and visualizing microarray data.
NM Luscombe, TE Royce, P Bertone, N Echols, CE Horak, JT Chang, M Snyder, M Gerstein (2003). Nucleic Acids Res 31: 3477-82.

Genomics. Defining genes in the genomics era.
M Snyder, M Gerstein (2003). Science 300: 258-60.

A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution.
P Harrison, A Kumar, N Lan, N Echols, M Snyder, M Gerstein (2002). J Mol Biol 316: 409-19.

Relating whole-genome expression data with protein-protein interactions.
R Jansen, D Greenbaum, M Gerstein (2002). Genome Res 12: 37-46.

Proteomics. Integrating interactomes.
M Gerstein, N Lan, R Jansen (2002). Science 295: 284-7.

Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model.
J Qian, NM Luscombe, M Gerstein (2001). J Mol Biol 313: 673-81.

SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics.
P Bertone, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, AM Edwards, CH Arrowsmith, GT Montelione, M Gerstein (2001). Nucleic Acids Res 29: 2884-98.

Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome.
PM Harrison, N Echols, MB Gerstein (2001). Nucleic Acids Res 29: 818-30.

A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome.
A Drawid, M Gerstein (2000). J Mol Biol 301: 1059-75.

Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels.
J Lin, M Gerstein (2000). Genome Res 10: 808-18.

Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores.
CA Wilson, J Kreychman, M Gerstein (2000). J Mol Biol 297: 233-49.

The packing density in proteins: standard radii and volumes.
J Tsai, R Taylor, C Chothia, M Gerstein (1999). J Mol Biol 290: 253-66.

Advances in structural genomics.
SA Teichmann, C Chothia, M Gerstein (1999). Curr Opin Struct Biol 9: 390-9.

The relationship between protein structure and function: a comprehensive survey with application to the yeast genome.
H Hegyi, M Gerstein (1999). J Mol Biol 288: 147-64.

A database of macromolecular motions.
M Gerstein, W Krebs (1998). Nucleic Acids Res 26: 4280-90.

Return to front page