| ||||
|
October 2000 Volume 7 Number 10 pp 903 - 909 Structural proteomics of an archaeon Dinesh Christendat1, 2, Adelinda Yee1, 2, Akil Dharamsi1, 3, Yuval Kluger4, Alexei Savchenko1, John R. Cort5, Valerie Booth1, Cameron D. Mackereth6, Vivian Saridakis1, Irena Ekiel7, Guennadi Kozlov8, Karen L. Maxwell9, Ning Wu1, Lawrence P. McIntosh6, Kalle Gehring8, Michael A. Kennedy5, Alan R. Davidson9, 10, Emil F. Pai1, 9, 10, Mark Gerstein4, Aled M. Edwards1, 11 & Cheryl H. Arrowsmith1 1. Ontario Cancer Institute and Department of Medical Biophysics, University of Toronto 610 University Ave, Toronto, Ontario, Canada M5G 2M9. A set of 424 nonmembrane proteins from Methanobacterium thermoautotrophicum were cloned, expressed and purified for structural studies. Of these, 20% were found to be suitable candidates for X-ray crystallographic or NMR spectroscopic analysis without further optimization of conditions, providing an estimate of the number of the most accessible structural targets in the proteome. A retrospective analysis of the experimental behavior of these proteins suggested some simple relations between sequence and solubility, implying that data bases of protein properties will be useful in optimizing high throughput strategies. Of the first 10 structures determined, several provided clues to biochemical functions that were not detectable from sequence analysis, and in many cases these putative functions could be readily confirmed by biochemical methods. This demonstrates that structural proteomics is feasible and can play a central role in functional genomics. |
|
The completion and near completion of the sequencing phase of genome projects has ushered in the age of proteomics, the study of all gene products in an organism. This flood of sequence information coupled with recent advances in molecular and structural biology have led to the concept of 'structural proteomics' or 'structural genomics', the determination of three-dimensional protein structures on a genome-wide scale. An important use of three-dimensional structural information of proteins is to uncover clues as to a protein's function that are not detectable from sequence analysis1, 2. This application of structural proteomics is driven by the realization that <30% of all predicted eukaryotic proteins have a known function. A related use of structural proteomics information is to determine a sufficient number of three-dimensional structures necessary to define a 'basic parts list' of protein folds3, 4. Most other structures could then be modeled from this basis set using computational techniques3, 5. The long term goal is to determine experimental structures for all proteins because it is the subtle differences in protein structure that contribute to the diversity and complexity of life, and current modeling techniques are not yet accurate enough to reveal these subtleties6. As reported in this manuscript, we initiated a prototype structural proteomics study of 424 nonmembrane proteins from the proteome of Methanobacterium thermoautotrophicum H (M.th.). The primary goals of this research are to evaluate the technical hurdles involved in such a high throughput project, to estimate the percentage of proteins encoded by a genome that are immediately amenable to structure analysis, and to assess the extent to which function can be inferred from structure. Target selection |
|
Cloning and expression strategy The M.th. ORFs were divided arbitrarily into two groups, 'large' (>20 kDa monomer size) and 'small' (<20 kDa monomer size). Large proteins were processed for crystallization trials and small proteins for NMR feasibility studies. Most (80%) successfully cloned M.th. proteins could be expressed in E. coli BL21-Gold (DE3) cells (Stratagene), although efficient expression often required the presence of a second plasmid encoding three tRNAs that are frequently used by archeons and eukaryotes but are rare in E. coli. While most proteins could be expressed to reasonable levels, many were not expressed in soluble form (<0.5 mg l-1 soluble protein), especially in the case of the larger proteins (Fig. 2). It may be possible to reduce the attrition rate due to poor solubility by optimizing the expression conditions for each clone. However, in the interests of throughput we used a single set of growth conditions optimized for the majority of M.th. proteins. Preparation and screening of structural samples |
|
To better understand the contribution of protein stability to sample behavior, the thermal unfolding of 60 folded M.th. proteins was analyzed. Of these, 22 could be unfolded and refolded in a fully reversible manner. However, among the 19 proteins with 'excellent' NMR spectra that were tested in this manner, only nine refolded reversibly. The others precipitated at high temperatures, demonstrating that even among well-folded, small, soluble proteins, reversible thermal unfolding in vitro is not a ubiquitous property. Surprisingly, eight proteins classified as 'aggregated' by NMR were well behaved in thermal unfolding experiments, indicating that these proteins are probably large discrete oligomers rather than nonspecific aggregates. As expected for proteins from a thermophilic organism, those from M.th . all possessed high thermostability with transition midpoint temperature (Tm) values between 68 and 98 °C. Due to their low change in heat capacity (Cp) upon unfolding, small proteins are generally expected to have higher Tm values compared to larger proteins8. Here, however, we observed no correlation between the length of the M.th. proteins and their Tm values. The C p values of small M.th. proteins were within the expected range compared to a large number of other proteins that have been investigated9. These data suggest that except for their high thermal stability, the overall thermodynamic behavior of M.th. proteins studied here may be representative of other mesophilic organisms. Retrospective analysis of biophysical data |
|
The full tree classifying the proteins according to their solubility (yes/no) had 35 final nodes and 65% overall accuracy in cross-validated tests. However, a number of the rules encoded within the tree were of better predictive value (these are highlighted in Fig. 3). For example, proteins that fulfill the following sequence of four conditions are likely to be insoluble: (i) have a hydrophobic stretch a long region (>20 residues) with average GES-scale hydrophobicity < -0.85 kcal mole-1; (ii) Gln composition <4%; (iii) Asp + Glu composition <17%; and (iv) aromatic composition >7.5%. This rule has a 14% error rate in comparison to the default error rate of 39% for choosing a soluble protein without the aid of the tree. The probability that it could arise by chance is 1%, assuming one randomly chose the 24 insoluble proteins from the initial pool of 143 insoluble and 213 soluble proteins. These calculations are based on a 'pessimistic estimate for errors'10, taking the upper bound of the 95% confidence interval (see Fig. 3 for details). Conversely, proteins that do not have a hydrophobic stretch and have more than 27% of their residues in (hydrophilic) 'low complexity' regions are likely to be soluble. This rule has a 'pessimistic' error rate of 20% in contrast to 39% without the tree and a 1% probability of occurring by chance. We also derived similar trees for expressibility and crystallizability (available from http://bioinfo.mbb.yale.edu/labdb/ datamine). The statistics for these were less reliable due to their smaller size. However, we did find that composition of Asn appeared to be relevant to crystallizability. In particular, an Asn threshold of 3.5% was able to select a set of 18 crystallizable and only one noncrystallizable protein from our initial set of 25 crystallizable and 39 noncrystallizable proteins. Together these data suggest that, given a large enough data set, it may be possible to derive sets of 'rules' from primary sequence that are predictive of a given protein's biophysical properties. Structural and functional analysis |
|
Five of the 10 structures either contained a bound ligand (providing an immediate function) or a ligand binding site that could be inferred from structural homology. MTH150 was originally annotated as 'conserved', being highly homologous to a family of archaeal proteins of unknown function. This protein copurified and cocrystallized with NAD+, immediately revealing at least one biochemical function. MTH150 has a nucleotide binding fold and structural similarity to a number of nucleotidyltransferases. Furthermore, MTH150 contains an HXGH motif that is similarly positioned to the acitive site HXGH motif found in these enzymes, suggesting a similar activity. A literature search for adenylyl transferases revealed that subsequent to our structure determination, a sequence homolog from M. jannaschii was reported to catalyze the condensation of nicotinaminde mononucleotide with ATP15, suggesting that the MTH150 cocrystal contains the product-bound form of the M.th. enzyme. Additional biochemical studies have confirmed that MTH150 indeed has nicotinamide mononucleotide adenylyltransferase (NMNATase) activity. MTH152 also shares sequence homology with several other archaeal proteins of unknown function. The purified protein was yellow, indicative of a flavin-like ligand bound to the protein, and crystallization required the presence of Ni+2. Anomalous dispersion from the Ni2+ ions and MAD phasing was used to solve the structure of this cocrystal. The structure showed that Ni2+ is octahedrally coordinated with both the protein and the phosphate moiety of bound flavin mononucleotide (FMN), suggesting that Ni2+ may play an integral role in cofactor binding and/or participate in a catalytic mechanism. Although MTH538 is annotated as 'unknown', its NMR solution structure16 uncovered a strong structural similarity with two protein classes, flavodoxin and the CheY family of bacterial response regulator proteins and domains. NMR chemical shift perturbation studies of MTH538 with either FMN or F420, a related flavin-like compound found in methanogens, showed no evidence for binding of these cofactors. In contrast, titration of MTH538 with Mg2+, a cofactor required for phosphorylation of CheY and related proteins, caused specific chemical shift perturbations for those residues that are predicted be affected by Mg2+ based on structural homology17. However, MTH538 lacks a critical Asp, which is phosphorylated in CheY, and, unlike CheY, is not affected by treatment with acetyl phosphate. However, the substantial structural similarity between these proteins and the phosphorylation independent receiver domain AmiC, together with a small level of sequence similarity between MTH538 and a family of putative ATPases or kinases (COG1618), suggests MTH538 may have a role in a phosphorylation independent two component system, such as the AmiR-AmiC system18. |
|
MTH129 is a known ortholog of orotidine 5' monophosphate decarboxylase, which catalyzes the exchange of CO2 for a proton at the C 6 position of uridine 5' monophosphate (UMP) at a rate 17 orders of magnitude faster than that of the uncatalyzed reaction in aqueous solution. Prior to this project, neither the structure nor the catalytic mechanism of any member of this enzyme family was known. The three-dimensional crystal structure of MTH129 in both the free and inhibitor bound forms allowed a detailed understanding of the remarkable catalytic power of this enzyme19. The NMR solution structure of MTH40 (ref. 20), which is homologous to the essential RPB10 subunit of RNA polymerase II, revealed a novel Zn binding motif that we term a Zn bundle. The protein folds as a three-helix bundle stabilized by a metal ion coordinated by a highly conserved but atypical CX2CXnCC sequence motif (where X is any naturally occurring amino acids). This represents the first example of two adjacent zinc binding Cys residues within an -helix, thus expanding the data base of known metal binding motifs. Based on the pattern of conserved and charged surface residues, as well as structural similarity to the N-terminal Zn binding domains of HIV-1 and HIV-2 integrases, insights were gained into the potential role of RPB10 as a scaffold protein within the multisubunit polymerase. These insights were confirmed when the NMR derived structure of M.th. RPB10 was used to help interpret the electron density map of the homologous subunit in yeast RNA polymerase II21. This demonstrates that generating high quality structures of individual proteins or domains in the context of structural genomics projects can facilitate the structure determination of larger biomolecular complexes. Similarly, MTH1048 is homologous to the RPB5 subunit of RNA polymerase II. The role of this subunit within polymerase is poorly understood. The solution NMR structure of this protein22 revealed a distinctive 'mushroom'-like shape with one half of the molecular surface composed of conserved hydrophobic residues suggestive of a site for proteinprotein interactions. In contrast the opposite surface contains fewer conserved residues and a high density of charged residues suggestive of a region that is either solvent exposed or possibly one that interacts with nucleic acids. This interpretation was also confirmed by examination of the crystal structure of yeast RNA polymerase II, in which the 'stem' of the mushroom is buried within RPB1, and the 'cap' of the mushroom is more exposed21. |
|
The case of MTH1615 is particularly illustrative of both the reiterative uses of NMR and other biophysical methods in determining protein structures and in how knowledge of a structure can lead to simple experiments that provide immediate insights into function. MTH1615 is a member of a gene family whose human ortholog was identified in a screen for genes involved in apoptosis23. The NMR spectrum was promising, but not immediately amenable to rapid structure determination; the initial HSQC of this protein had many peaks in the central random coil region, indicative of an unfolded polypeptide, as well as a large number of dispersed peaks indicative of a folded structure. Using limited proteolysis followed by mass spectrometry and NMR analysis, we found that the N-terminal 31 residues of MTH1615 were unstructured in solution. A smaller construct lacking the first 31 residues was then prepared, and the structure of this protease resistant domain determined using NMR spectroscopy. From this structure we modeled that of the human protein, revealing a conserved basic cleft. Since the human protein has been shown to localize to the nucleus, we also tested MTH1615 for DNA binding activity. Using electrophoretic mobility shift assays (EMSA), we found that MTH1615 can interact nonspecifically with a randomly chosen 20-mer of double stranded DNA, suggesting that the human protein may be involved in nucleic acid binding or metabolism. It is important to note that it was difficult to identify the unstructured N-terminal region from sequence analysis. This region is strongly predicted to be an -helix using PHD24 and the alignment of seven highly conserved orthologs from M.th. to human. MTH1699 was identified from sequence analysis as an archaebacterial translation elongation factor 1 (aEF-1), which acts as a guanine nucleotide exchange factor. It has an / sandwich fold typical of many RNA binding proteins. While this structure determination was underway25 the structure of human EF-126 was published, revealing structural homology to the functionally similar eubacterial EF-Ts protein. These three structures provide a dramatic example of functional and structural conservation that is not evident from sequence. In all three kingdoms nucleotide exchange occurs in an identical manner, via a conserved Phe or Tyr on an exposed loop. However, unlike either hEF-1 or EF-Ts, MTH1699 was found to bind calcium. This novel feature may play a functional role in archaeal protein translation or may simply serve to increase the protein's thermal stability. MTH1184 is annotated as 'unknown', and consists of a -sheet region followed by an -helix and an unstructured C-terminus. The -sheet region contains a CXCX...XCXC sequence with Cys residues located in two proximal loops and pointing towards each other. While this motif is potentially capable of metal binding, we were unable to detect zinc binding to the protein, suggesting specificity for another metal. MTH1175 is a member of an uncharacterized COG (COG1433) that is predominantly represented by archaea. Therefore, from sequence homology it is currently impossible to gain insight into its function. SCOP analysis of MTH1175 reveals that it is most similar to structures within the ribonuclease H superfamily. While, MTH1175 lacks the key catalytic residues of RNase H, an RNA binding capability is suggested by a Gly and Arg-rich region in the flexible C-terminus. However the biochemical function of this protein remains to be determined. Conclusions |
|
Sample preparation. All cloning and initial expression, purification and HSQC/crystal screens were performed at the Ontario Cancer Institute (OCI) over a 12 month period by A.D., D.C. & A.Y. with the help of one FTE technician and (for 4 months) six summer students. In some cases clones were sent to individual labs where they were expressed with the appropriate isotopes or SeMet labels, data acquired and structures solved (MTH129 by N.W.; MTH1184 and MTH1699 by G.K.and I.E.; MT0040 by C.M.). In the remaining cases the NMR sample or SeMet crystal was prepared at the OCI and the NMR or diffraction data collected and structures solved by individual labs (MTH538 and MTH1175 by J.R.C. and M.A.K.; MTH150 by D.C.; MTH1615 by V.B.; MTH150 by V.S.; MTH1048 by A.Y.). CD data were collected and analyzed by A.S., K.L.M. and A.R.D. Each target gene was PCR amplified from genomic DNA, with terminal incorporation of unique restriction sites, using high fidelity Pfu DNA polymerase (Stratagene). The PCR products were directionally cloned into the pET15b bacterial expression vector (NOVAGEN). A single PCR protocol and set of cloning conditions were optimized for M.th. based on an initial set of 50 genes. Positive clones were confirmed by colony PCR screening using Taq DNA polymerase. For large proteins (>20 kDa per monomer), three colonies from each transformation were tested for protein expression on a small scale (50 ml). Proteins found to be soluble by SDS-PAGE analysis of the bacterial extract were prepared on a larger scale (2 l). These proteins were purified by a combination of heat treatment (55 °C) and nickel affinity chromatography, followed by thrombin cleavage and removal of the hexa-His tag. The heat treatment caused a significant enrichment of many, but not all, M.th. proteins. Proteins were judged to be 99% pure as judged by an overloaded coumassie blue stained SDS gel. Occasionally mass spectrometry was used to monitor the integrity of purified proteins. No evidence of frequent or systematic mutations was found. Proteins that 'survived' the purification process (75%) were concentrated to 10 mg ml-1 and subjected to a sparse-matrix crystallization screen of 48 conditions at room temperature (Msatrix screen 1; Hampton Research). |
|
Proteins for initial NMR HSQC screens were expressed five at a time, each in 1 l of 15N-enriched minimal media and purified in parallel using metal affinity chromatography. The resulting 15N-labeled hexa-His fusion proteins were concentrated by ultrafiltration to 520 mg ml-1 and were typically 9095% pure as judged by coumassie blue stained SDS-PAGE. 15N,13C-labeled proteins prepared for three-dimensional structure determination were further purified to >98% purity by either gel filtration or ion exchange chromatography after removal of the His tag. Decision tree analysis. This analysis was carried out by Y.K. and M.B.G. Under each intermediate node, the decision tree algorithm calculates all possible splitting thresholds for each of 53 variables (hydrophobicity, amino acid composition, etc.). It picks the optimal splitting variable and its threshold, in order for at least one of the two daughter nodes to be as homogeneous as possible. When a variable, v, is split, v < threshold is the left branch, and v > threshold is the right branch. The specific parameters used at each node and their thresholds for the right branches shown in Fig. 3 are in descending order (from top root to bottom leaves): hydrophobe > 0.85 kcal mole-1 (where 'hydrophobe' represents the average GES hydrophobicity of a sequence stretch; the higher this value, the lower the energy transfer); cplx > 0.28 (where 'cplx' is a measure of a local sequence complexity region based on the SEG program10); Q > 4%; DE > 17%; I > 5.6%; FWY > 7.5%; DE > 13.6%; GAVLI > 42%; hydrophobe > 0.01 kcal mole-1; HKR > 12%; W composition > 1.2%; and -helical secondary structure composition > 58%. In the preceding pathway, Q represents Gln composition; DE, Asp + Glu composition, and other quantities are defined similarly. Note that two of the variables are conditioned on more than once (hydrophobe, Asp + Glu). The shorter the decision pathway and the larger the number of cases in the terminal node, the lower the risk of over-fitting the data. Heterogeneous leaves could be further split (dotted lines in Fig. 3) improving the error rate but risking overfitting of the training set. The predictive values of the pathways were evaluated using a 'pessimistic estimation' procedure that assumed that the error rate at each node is bionomially distributed, and then inflates the rate found on a tree based on all the data (by 2 standard deviations) to arrive at a more realistic estimate10, 29. Further details can be found at http://bioinfo.mbb.yale.edu/labdb/datamine . Coordinates. The PDB accession codes for each structure (and the BioMagResBank accession numbers where applicable) are as follows: MTH150, 1ej2; MTH152, 1eje; MTH538, 1eiw (4793); MTH129, 1dv7; MTH40, 1ef4(4571); MTH1048, 1eik(4678); MTH1615, 1eij(4674); MTH1699, 1d5k(4385); MTH1184, 1dw7(4740); MTH1175, 1eo1(4796). Received 20 June 2000; Accepted 2 August 2000. |
| |