|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() | ![]() ![]() ![]() ![]() | ![]() | |
![]() ![]() |
![]() ![]()
![]() ![]() ![]() ![]() |
October 2000 Volume 7 Number 10 pp 903 - 909 Structural proteomics of an archaeon Dinesh Christendat1, 2, Adelinda Yee1, 2, Akil Dharamsi1, 3, Yuval Kluger4, Alexei Savchenko1, John R. Cort5, Valerie Booth1, Cameron D. Mackereth6, Vivian Saridakis1, Irena Ekiel7, Guennadi Kozlov8, Karen L. Maxwell9, Ning Wu1, Lawrence P. McIntosh6, Kalle Gehring8, Michael A. Kennedy5, Alan R. Davidson9, 10, Emil F. Pai1, 9, 10, Mark Gerstein4, Aled M. Edwards1, 11 & Cheryl H. Arrowsmith1 1. Ontario Cancer Institute and Department of Medical Biophysics, University of Toronto 610 University Ave, Toronto, Ontario, Canada M5G 2M9. A set of 424 nonmembrane proteins from Methanobacterium thermoautotrophicum
were cloned, expressed and purified for structural studies. Of these, |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The completion and near completion of the sequencing phase of genome projects has ushered in the age of proteomics, the study of all gene products in an organism. This flood of sequence information coupled with recent advances in molecular and structural biology have led to the concept of 'structural proteomics' or 'structural genomics', the determination of three-dimensional protein structures on a genome-wide scale. An important use of three-dimensional structural information of proteins is to uncover clues as to a protein's function that are not detectable from sequence analysis1, 2. This application of structural proteomics is driven by the realization that <30% of all predicted eukaryotic proteins have a known function. A related use of structural proteomics information is to determine a sufficient number of three-dimensional structures necessary to define a 'basic parts list' of protein folds3, 4. Most other structures could then be modeled from this basis set using computational techniques3, 5. The long term goal is to determine experimental structures for all proteins because it is the subtle differences in protein structure that contribute to the diversity and complexity of life, and current modeling techniques are not yet accurate enough to reveal these subtleties6. As reported in this manuscript, we initiated a prototype structural proteomics
study of 424 nonmembrane proteins from the proteome of Methanobacterium
thermoautotrophicum Target selection |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Cloning and expression strategy The M.th. ORFs were divided arbitrarily into two groups, 'large'
(>20 kDa monomer size) and 'small' (<20 kDa monomer size).
Large proteins were processed for crystallization trials and small proteins
for NMR feasibility studies. Most ( Preparation and screening of structural samples |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To better understand the contribution of protein stability to sample behavior, the thermal unfolding of 60 folded M.th. proteins was analyzed. Of these, 22 could be unfolded and refolded in a fully reversible manner. However, among the 19 proteins with 'excellent' NMR spectra that were tested in this manner, only nine refolded reversibly. The others precipitated at high temperatures, demonstrating that even among well-folded, small, soluble proteins, reversible thermal unfolding in vitro is not a ubiquitous property. Surprisingly, eight proteins classified as 'aggregated' by NMR were well behaved in thermal unfolding experiments, indicating that these proteins are probably large discrete oligomers rather than nonspecific aggregates. As expected for proteins from a thermophilic organism, those from M.th
. all possessed high thermostability with transition midpoint temperature
(Tm) values between 68 and 98 °C. Due to their low change in
heat capacity ( Retrospective analysis of biophysical data |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The full tree classifying the proteins according to their solubility (yes/no) had 35 final nodes and 65% overall accuracy in cross-validated tests. However, a number of the rules encoded within the tree were of better predictive value (these are highlighted in Fig. 3). For example, proteins that fulfill the following sequence of four conditions are likely to be insoluble: (i) have a hydrophobic stretch a long region (>20 residues) with average GES-scale hydrophobicity < -0.85 kcal mole-1; (ii) Gln composition <4%; (iii) Asp + Glu composition <17%; and (iv) aromatic composition >7.5%. This rule has a 14% error rate in comparison to the default error rate of 39% for choosing a soluble protein without the aid of the tree. The probability that it could arise by chance is 1%, assuming one randomly chose the 24 insoluble proteins from the initial pool of 143 insoluble and 213 soluble proteins. These calculations are based on a 'pessimistic estimate for errors'10, taking the upper bound of the 95% confidence interval (see Fig. 3 for details). Conversely, proteins that do not have a hydrophobic stretch and have more than 27% of their residues in (hydrophilic) 'low complexity' regions are likely to be soluble. This rule has a 'pessimistic' error rate of 20% in contrast to 39% without the tree and a 1% probability of occurring by chance. We also derived similar trees for expressibility and crystallizability (available from http://bioinfo.mbb.yale.edu/labdb/ datamine). The statistics for these were less reliable due to their smaller size. However, we did find that composition of Asn appeared to be relevant to crystallizability. In particular, an Asn threshold of 3.5% was able to select a set of 18 crystallizable and only one noncrystallizable protein from our initial set of 25 crystallizable and 39 noncrystallizable proteins. Together these data suggest that, given a large enough data set, it may be possible to derive sets of 'rules' from primary sequence that are predictive of a given protein's biophysical properties. Structural and functional analysis |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Five of the 10 structures either contained a bound ligand (providing an immediate function) or a ligand binding site that could be inferred from structural homology. MTH150 was originally annotated as 'conserved', being highly homologous to a family of archaeal proteins of unknown function. This protein copurified and cocrystallized with NAD+, immediately revealing at least one biochemical function. MTH150 has a nucleotide binding fold and structural similarity to a number of nucleotidyltransferases. Furthermore, MTH150 contains an HXGH motif that is similarly positioned to the acitive site HXGH motif found in these enzymes, suggesting a similar activity. A literature search for adenylyl transferases revealed that subsequent to our structure determination, a sequence homolog from M. jannaschii was reported to catalyze the condensation of nicotinaminde mononucleotide with ATP15, suggesting that the MTH150 cocrystal contains the product-bound form of the M.th. enzyme. Additional biochemical studies have confirmed that MTH150 indeed has nicotinamide mononucleotide adenylyltransferase (NMNATase) activity. MTH152 also shares sequence homology with several other archaeal proteins of unknown function. The purified protein was yellow, indicative of a flavin-like ligand bound to the protein, and crystallization required the presence of Ni+2. Anomalous dispersion from the Ni2+ ions and MAD phasing was used to solve the structure of this cocrystal. The structure showed that Ni2+ is octahedrally coordinated with both the protein and the phosphate moiety of bound flavin mononucleotide (FMN), suggesting that Ni2+ may play an integral role in cofactor binding and/or participate in a catalytic mechanism. Although MTH538 is annotated as 'unknown', its NMR solution structure16 uncovered a strong structural similarity with two protein classes, flavodoxin and the CheY family of bacterial response regulator proteins and domains. NMR chemical shift perturbation studies of MTH538 with either FMN or F420, a related flavin-like compound found in methanogens, showed no evidence for binding of these cofactors. In contrast, titration of MTH538 with Mg2+, a cofactor required for phosphorylation of CheY and related proteins, caused specific chemical shift perturbations for those residues that are predicted be affected by Mg2+ based on structural homology17. However, MTH538 lacks a critical Asp, which is phosphorylated in CheY, and, unlike CheY, is not affected by treatment with acetyl phosphate. However, the substantial structural similarity between these proteins and the phosphorylation independent receiver domain AmiC, together with a small level of sequence similarity between MTH538 and a family of putative ATPases or kinases (COG1618), suggests MTH538 may have a role in a phosphorylation independent two component system, such as the AmiR-AmiC system18. |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
MTH129 is a known ortholog of orotidine 5' monophosphate decarboxylase, which catalyzes the exchange of CO2 for a proton at the C 6 position of uridine 5' monophosphate (UMP) at a rate 17 orders of magnitude faster than that of the uncatalyzed reaction in aqueous solution. Prior to this project, neither the structure nor the catalytic mechanism of any member of this enzyme family was known. The three-dimensional crystal structure of MTH129 in both the free and inhibitor bound forms allowed a detailed understanding of the remarkable catalytic power of this enzyme19. The NMR solution structure of MTH40 (ref. 20),
which is homologous to the essential RPB10 subunit of RNA polymerase II, revealed
a novel Zn binding motif that we term a Zn bundle. The protein folds as a
three-helix bundle stabilized by a metal ion coordinated by a highly conserved
but atypical CX2CXnCC sequence motif (where X is any
naturally occurring amino acids). This represents the first example of two
adjacent zinc binding Cys residues within an Similarly, MTH1048 is homologous to the RPB5 subunit of RNA polymerase II. The role of this subunit within polymerase is poorly understood. The solution NMR structure of this protein22 revealed a distinctive 'mushroom'-like shape with one half of the molecular surface composed of conserved hydrophobic residues suggestive of a site for proteinprotein interactions. In contrast the opposite surface contains fewer conserved residues and a high density of charged residues suggestive of a region that is either solvent exposed or possibly one that interacts with nucleic acids. This interpretation was also confirmed by examination of the crystal structure of yeast RNA polymerase II, in which the 'stem' of the mushroom is buried within RPB1, and the 'cap' of the mushroom is more exposed21. |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The case of MTH1615 is particularly illustrative of both the reiterative
uses of NMR and other biophysical methods in determining protein structures
and in how knowledge of a structure can lead to simple experiments that provide
immediate insights into function. MTH1615 is a member of a gene family whose
human ortholog was identified in a screen for genes involved in apoptosis23. The NMR spectrum was promising, but not immediately amenable to
rapid structure determination; the initial HSQC of this protein had many peaks
in the central random coil region, indicative of an unfolded polypeptide,
as well as a large number of dispersed peaks indicative of a folded structure.
Using limited proteolysis followed by mass spectrometry and NMR analysis,
we found that the N-terminal 31 residues of MTH1615 were unstructured in solution.
A smaller construct lacking the first 31 residues was then prepared, and the
structure of this protease resistant domain determined using NMR spectroscopy.
From this structure we modeled that of the human protein, revealing a conserved
basic cleft. Since the human protein has been shown to localize to the nucleus,
we also tested MTH1615 for DNA binding activity. Using electrophoretic mobility
shift assays (EMSA), we found that MTH1615 can interact nonspecifically with
a randomly chosen 20-mer of double stranded DNA, suggesting that the human
protein may be involved in nucleic acid binding or metabolism. It is important
to note that it was difficult to identify the unstructured N-terminal region
from sequence analysis. This region is strongly predicted to be an MTH1699 was identified from sequence analysis as an archaebacterial translation
elongation factor 1 MTH1184 is annotated as 'unknown', and consists of a MTH1175 is a member of an uncharacterized COG (COG1433) that is predominantly represented by archaea. Therefore, from sequence homology it is currently impossible to gain insight into its function. SCOP analysis of MTH1175 reveals that it is most similar to structures within the ribonuclease H superfamily. While, MTH1175 lacks the key catalytic residues of RNase H, an RNA binding capability is suggested by a Gly and Arg-rich region in the flexible C-terminus. However the biochemical function of this protein remains to be determined. Conclusions |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sample preparation. All cloning and initial expression, purification and HSQC/crystal screens were performed at the Ontario Cancer Institute (OCI) over a 12 month period by A.D., D.C. & A.Y. with the help of one FTE technician and (for 4 months) six summer students. In some cases clones were sent to individual labs where they were expressed with the appropriate isotopes or SeMet labels, data acquired and structures solved (MTH129 by N.W.; MTH1184 and MTH1699 by G.K.and I.E.; MT0040 by C.M.). In the remaining cases the NMR sample or SeMet crystal was prepared at the OCI and the NMR or diffraction data collected and structures solved by individual labs (MTH538 and MTH1175 by J.R.C. and M.A.K.; MTH150 by D.C.; MTH1615 by V.B.; MTH150 by V.S.; MTH1048 by A.Y.). CD data were collected and analyzed by A.S., K.L.M. and A.R.D. Each target gene was PCR amplified from genomic DNA, with terminal incorporation of unique restriction sites, using high fidelity Pfu DNA polymerase (Stratagene). The PCR products were directionally cloned into the pET15b bacterial expression vector (NOVAGEN). A single PCR protocol and set of cloning conditions were optimized for M.th. based on an initial set of 50 genes. Positive clones were confirmed by colony PCR screening using Taq DNA polymerase. For large proteins (>20 kDa per monomer), three colonies from each transformation
were tested for protein expression on a small scale (50 ml). Proteins found
to be soluble by SDS-PAGE analysis of the bacterial extract were prepared
on a larger scale (2 l). These proteins were purified by a combination of
heat treatment (55 °C) and nickel affinity chromatography, followed by
thrombin cleavage and removal of the hexa-His tag. The heat treatment caused
a significant enrichment of many, but not all, M.th. proteins. Proteins
were judged to be 99% pure as judged by an overloaded coumassie blue stained
SDS gel. Occasionally mass spectrometry was used to monitor the integrity
of purified proteins. No evidence of frequent or systematic mutations was
found. Proteins that 'survived' the purification process ( |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Proteins for initial NMR HSQC screens were expressed five at a time, each
in 1 l of 15N-enriched minimal media and purified in parallel
using metal affinity chromatography. The resulting 15N-labeled
hexa-His fusion proteins were concentrated by ultrafiltration to Decision tree analysis. This analysis was carried out
by Y.K. and M.B.G. Under each intermediate node, the decision tree algorithm
calculates all possible splitting thresholds for each of 53 variables (hydrophobicity,
amino acid composition, etc.). It picks the optimal splitting variable and
its threshold, in order for at least one of the two daughter nodes to be as
homogeneous as possible. When a variable, v, is split, v < threshold is
the left branch, and v > threshold is the right branch. The specific parameters
used at each node and their thresholds for the right branches shown in Fig. 3 are in descending order (from top root to bottom leaves):
hydrophobe > 0.85 kcal mole-1 (where 'hydrophobe'
represents the average GES hydrophobicity of a sequence stretch; the higher
this value, the lower the energy transfer); cplx > 0.28 (where 'cplx'
is a measure of a local sequence complexity region based on the SEG program10); Q > 4%; DE > 17%; I > 5.6%; FWY > 7.5%; DE > 13.6%; GAVLI > 42%;
hydrophobe > 0.01 kcal mole-1; HKR > 12%; W composition
> 1.2%; and Coordinates. The PDB accession codes for each structure (and the BioMagResBank accession numbers where applicable) are as follows: MTH150, 1ej2; MTH152, 1eje; MTH538, 1eiw (4793); MTH129, 1dv7; MTH40, 1ef4(4571); MTH1048, 1eik(4678); MTH1615, 1eij(4674); MTH1699, 1d5k(4385); MTH1184, 1dw7(4740); MTH1175, 1eo1(4796). Received 20 June 2000; Accepted 2 August 2000. |
![]() ![]() | |
![]() | ![]() ![]() ![]() ![]() ![]() ![]() ![]() |