Research Program, Figures page 1, 2, 3 and 4 (ppt of figures),
As we move into the 21st century, the biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a most dramatic example of this. Simultaneously, with this increase in biological data, computers and computation have had a transforming effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. The goal of my lab is to connect these two developments, harnessing computational advances for the analysis of large-scale biological data, principally by carrying out integrative surveys and systematic data mining.
We see the raw human genome sequence as the natural starting point for biological sciences research in the future. Specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. Through such work we can address two central post-genomic challenges: (i) interpreting the intergenic regions between genes and (ii) understanding genes in detail. With regard to the first challenge, we are analyzing protein fossils (pseudogenes) in intergenic regions to determine what they tell us about the molecular history of humans. With regard to the second challenge we are pursuing three inter-related avenues of research. (ii-a) We are trying to connect proteins into networks, circumscribing their function. (ii-b) We are trying to understand how a range of structural and functional diversity can be achieved from a limited repertoire of protein folds and families. And (ii-c) we are trying to interpret protein function in terms of macromolecular motions and understand these in terms of 3D-structural packing.
Overall these four research topics follow a progression from surveying the overall genomic landscape through analyzing in more detail at individual proteins to zooming in to the detailed chemical structure of particular molecules. In all our research, we work in a collaborative framework as a member of multi-disciplinary teams.
In our genomics work, we particularly focus on interpreting intergenic regions in terms of pseudogenes. We were one of the first groups to perform comprehensive surveys of pseudogenes on a genome-wide scale in terms of protein families, which we did for human, worm, yeast and number of other organisms (Zhang et al., 2002a, 2002b, 2003, 2004; Harrison et al., 2001, 2002a, 2002c, 2003a, 2003b; Zhang & Gerstein, 2003c, 2003e; Liu et al., 2004a; Pseudogene.org, Fig. 1a) Collectively, these studies enable us to determine the common "pseudofolds" and "pseudofamilies" in various genomes and to address important evolutionary questions about the type of proteins that were present in the past history of an organism (Zhang & Gerstein, 2004; Harrison & Gerstein, 2002; Snyder & Gerstein, 2003; Harrison et al., 2002b, Fig. 1b,c). Overall, we have found the following pattern: processed pseudogenes tend to be associated with highly expressed proteins; duplicated pseudogenes are associated with environmental response proteins (e.g. chemoreceptors); and prokaryotic pseudogenes are associated with horizontal transfer events. Pseudogenes are also important components in our studies of the natural rate of mutation and variation in the genome and how this is coupled with functional shifts (Zhang & Gerstein, 2003b; Echols et al., 2002; Balasubramanian et al., 2002; Balasubramanian et al., 2005; Naylor & Gerstein, 2000).
We are involved in a number of experimental collaborations (e.g. ENCODE) to probe the activity of intergenic regions with tiling array technology (Bertone et al., 2004; Martone et al., 2003; Encode Project Consortium, 2004, Fig. 1f). We have developed tools to design, score and interpret these arrays and to highlight particular array artifacts (Royce et al., 2005, 2006; Bertone et al., 2006; Luscombe et al., 2003; Kluger et al., 2003a; Qian et al., 2003b; Tiling.gersteinlab.org, Fig. 1e). The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. Coupling this work with that on pseudogenes, we have found hints that some of the supposedly "dead" pseudogenes may actually have some activity (Zheng et al., 2005; Harrison et al., 2005, Fig. 1d).
After the main elements of the human genome are identified, one needs to characterize their function. Because of the size and complexity of the genome, it is not feasible to carry out conventional biochemical experiments for the functional annotation of each protein encoded by the genome. Thus, a central problem in proteomics is how to determine protein function on a large-scale. There is a wide range of computational approaches to address this problem, from traditional sequence pattern matching to newer methods dealing with networks. The basic idea of the latter is to circumscribe the function of a gene in terms of all the other genes that it interacts with or that are associated with it in some sense (Lan et al., 2002, 2003). As outlined below, we are pursuing all of these approaches.
As a preliminary, we have carried out a number of studies inter-relating various functional genomics features in yeast, such as essentiality, localization, protein abundance and protein interactions, looking for non-obvious but large-scale correlations (Cheung et al., 2005; Yu et al., 2003; Drawid et al., 2000; Jansen & Gerstein, 2000; YeastHub.gersteinlab.org, Fig. 2a). In particular, we have examined the degree to which gene expression correlations relate to protein interactions and regulatory networks, developed tools for correlating and clustering expression profiles, and measured the extent to which protein abundance is related to gene expression (Greenbaum et al., 2001, 2002, 2003b; Kluger et al., 2003b; Jansen et al., 2002a, 2003b; Qian et al., 2001b).
The culmination of the above work inter-relating genomic features has been a systematic, statistical approach for integrating different sources of biological information. Our goal in this has been predicting protein networks and other aspects of protein function, such as protein sub-cellular localization. We were among the first groups to have carried out these studies on functional genomics data, employing a variety of data-mining techniques, such as Bayesian statistics and support vector machines (Drawid & Gerstein, 2000; Qian et al., 2003a; Jansen et al., 2002b, 2003a; Gerstein et al., 2002; Yu et al., 2004a; Mateos et al., 2002; Xia et al., 2006, Fig. 2b). We have had a number of localization and interaction predictions experimentally verified, and, in more recent work, we have carefully assessed the degree to which the data quality and the specific mining approach employed is associated with the strength of the predictions (Kumar et al., 2002; Jansen & Gerstein, 2004; Lu et al., 2005; Edwards et al., 2002; Lin et al., 2004).
We have also studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g. the diameter) and on a small-scale in terms of local network motifs. In particular, we have correlated network hubs with gene essentiality (Yu et al., 2004b, Fig. 2c). Recently, we developed a number of tools to build and analyze networks derived from genes and also from literature citations (Douglas et al., 2005; Xia et al., 2004; Yu et al., 2004b, 2006; TopNet.gersteinlab.org, PubNet.gersteinlab.org, Fig. 2e). We also have investigated the dynamics of networks -- i.e., how their topology changes over time. In particular, we have identified changing hubs and systematic patterns of connectivity rewiring in the yeast regulatory network (Luscombe et al., 2004, Fig. 2d).
Another area of research in our lab is structural genomics. Here, we conceptualize proteins not purely as character sequences or as abstract network nodes, but more in terms of their actual molecular structure.
A central tenet in structural genomics is that one can potentially obtain information about the function of an uncharacterized protein through determining its 3D structure and then searching for structural similarities to proteins of known function. Thus, large-scale characterization of protein structures provides another approach, in addition to protein networks, to functionally characterize the proteome. In order to address this issue, we have tried to measure, globally, the degree to which fold is associated with function (Hegyi & Gerstein, 1999, Fig. 3a).
We have also examined the large-scale relationships between sequence, structure and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language (Chothia & Gerstein, 1997; Levitt & Gerstein, 1998). The key issue here is defining appropriate sequence similarity thresholds for the transfer of annotation (Wilson et al., 2000; Hegyi & Gerstein, 2001, Fig. 3d). To aid in this endeavor we have developed an approach for structurally aligning proteins (Alexandrov & Gerstein, 2004; Gerstein & Levitt, 1998b).
Protein folds and families can also be related to phylogeny and deep evolutionary history; we were among the first laboratories to address questions of this nature (Gerstein & Levitt, 1997; Gerstein, 1997, 1998a, 2000; Gerstein & Hegyi, 1998; Teichmann et al., 1999; Liu et al., 2004b; Hegyi et al., 2002; Qian et al., 2000; PartsList.org). Our studies enabled us to recognize that particular folds are more common in certain organisms than in others. We also found that fold occurrence can be used to build whole-genome trees, with the distances between organisms defined in terms of the presence or absence of specific folds (Gerstein, 1998b; Lin & Gerstein, 2000; Lin et al., 2002; GeneCensus.org, Fig. 3b).
While we established that the most common folds often differed between genomes, in all cases the occurrence of folds tends to follow power-law statistics, with a few common folds and many rare ones. (Similar conclusions apply for many other aspects of genomic biology.) We have proposed a simple evolutionary model that naturally gives rise to these statistics (Qian et al., 2001a; Luscombe et al., 2002, Fig. 3e). We have also analyzed the association between protein families, particularly those of membrane proteins, with various motifs (Liu et al., 2002a; Senes et al., 2000; Schneider et al., 2002).
As part of our work on structural genomics, we also have begun to relate the properties of proteins with their eventual success at being purified and structurally characterized (Goh et al., 2004a; Bertone et al., 2001; Kimber et al., 2003; Savchenko et al., 2003; Christendat et al., 2000; Balasubramanian et al., 2000). This has been in the framework of a database and decision-tree mining framework that we have built for a structural genomics consortium (SPiNE.NESG.org; Goh et al., 2003; Wunderlich et al., 2004, Fig. 3c).
The final area of focus in the lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, we try to interpret macromolecular motions in terms of packing.
Our motions classification scheme is motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing can greatly constrains a protein's mobility. We have developed tools for measuring and comparing the packing efficiency at different interfaces (e.g. inter-domain, protein surface, helix-helix, protein vs. RNA) using specialized geometric constructions (e.g. Voronoi polyhedra) (Voss & Gerstein, 2005; Tsai et al., 1999, 2001; Tsai & Gerstein, 2002; Helix.gersteinlab.org, Fig. 4c). For this we have generated a new packing parameter set, which includes self-consistent VDW radii and standard volumes for each atom type.
In summary, my lab acts a connector, bringing quantitative approaches from disciplines such as computer science and applied math to bear on real questions and data in molecular biology. In particular, we have extensively applied classical computational approaches involving simulation, machine learning, and database design to biological problems.
We have engaged in quite a number of practical, experimental collaborations, where we have often functioned as part of multi-disciplinary team. This team participation is a key feature of the lab. The collaborative efforts that have not been explicitly referred to above (e.g. ENCODE and NESG) can roughly be arranged under a number of headings:
[i] proteomic studies involving protein chips and large-scale protein interaction sets (Ptacek et al., 2005; Hall et al., 2005; Zhu et al., 2000, 2001; Li et al., 2004; Gelperin et al., 2005);
[ii] medium-scale protein abundance studies (Lian et al., 2001, 2002);
[iii] functional genomics studies focused on profiling mutant yeast phenotypes (Kumar et al., 2002, 2004; Giaever et al., 2002; Ross-Macdonald et al., 1999);
[iv] gene expression analysis in human and plants (Gilad et al., 2005; Rinn et al., 2004; Jiao et al., 2003; Subrahmanyam et al., 2001);
[v] tiling array studies, particularly focused on regulatory network elucidation in human and yeast via ChIP-chip (Hartman et al., 2006; Urban et al., 2006; Stolc et al., 2005; Euskirchen et al., 2005; White et al., 2004; Rinn et al., 2003; Horak et al., 2002a,b); and
[vi] interpretative motif analysis, particularly in relation to membrane protein genetic-selection experiments and microRNA target identification (Huber et al., 2005; Grosshans et al., 2005; Coric et al., 2005; Freeman-Cook et al., 2004).
Moreover, as part of our mission to connect biology with computation, we have also extensively analyzed how a number of larger issues relating to computation in society impact upon biological research. In particular, we have examined how issues associated with e-publishing and digital libraries relate to biomedical databases and how various legal and security concerns significantly impact genomics database interoperation (Smith et al., 2005; Greenbaum et al., 2004; Greenbaum & Gerstein, 2003; Gerstein & Junker, 2002; Gerstein, 1999a,b,c; Gerstein, 2000, Fig. 4b). We envision a future where there will be less distinction between databases and journals. One will be able both to find understandable prose in database entries and to apply computation directly on specially constructed parts of journal articles. Such a scenario will help overcome many of the problems now facing biological databases, including quality control, attribution of credit, and error correction.
This document is closely coupled to my publication list in the following fashion: Most publications since the lab opened in 1/97 up to the present time (Dec. 2005) are referred to. The references are in the Jones et al. (2002) format. However, if there is more than one paper matching this citation, a letter (e.g. a, b, c, etc) is appended to citation in the order that the reference occurs in the publication list.
Note, to keep things simple:
(i) No attempt has been made to refer to the scientific literature generally, and this document should not be construed as a review of the field.
(ii) Each paper and URL is only cited once in the text, even when it could potentially be referred at multiple places in the text. This comment is most applicable to the experimental collaborations, particularly the tiling array analysis, which is referred to under the sections headed "Genomics" and "Broader Issues".