`~"88 @@@ @@@@N2I!89 EN DB 9     & ./ :7L ky Z   F  P Vfb   ` N{) @2l t2. Holroyd1998.VhHorikawa19981d Hornsby1998jHosouchi1996hHosoyama19981+ Howard20002k Howell19981] Huang1997 Hubbard1995Z Huber1998o Hutchinson19988 Huynen19986 Huynen19997 Huynen19999^ Hyman1999- Iliopoulos1999F Iyer19979i Jacq19961d Jagels19988E Jain1999H Jansen2000r Jaroszewski2000c Jiwani1997iJohnston1996? Jones1994 Jones1997 Jones1997o Jones1998: Jones1999NKallberg1999^ Kalman1999j Kaneko1996o Karmirantzou1998y Kaufman1997h Kawarabayasi1998c Keagle19979Z Keller19989g Kelley19959a Kerlavage1995g Kerlavage1995e Kerlavage1996[ Kerlavage1997] Kerlavage1997b Kerlavage1997[ Ketchum1997] Ketchum1997b Ketchum1997k Ketchum1998b Khalak19971k Khalak19989_ Khouri20000h Kikuchi1998V Kim1998j Kimura19969aKirkness19959eKirkness19966[Kirkness19979bKirkness19979` Kirkpatrick1997[ Klenk1997b Klenk1997_ Kolonay2000 Koonin199990 Koonin199991 Koonin199993 Koonin19999@ Koonin19999A Koonin20000U Koonin2000h Kosugi19989j Kotani1996| Kreychman2000h Kudoh1998\ Kunst1997l Kurland1998h Kushida1998[Kyrpides19977-Kyrpides19999E Lake1999^ Lammel19999o Laskowski1998]Lathigra1997[ Lee1997b Lee1997c Lee1997@ Leipe1999A Leipe2000Z Lenox1998~ Lesk19869< Levitt19989f Li19969I Lin2000J Lin2000_ Linher20000 Lipman19979pLo Conte1999[ Loftus19979b Loftus19979* Lohr20000i Louis1996c Lumm199799 MacCallum1999 Madden199790Makarova1999^ Marathe1999.Marcotte1999/Marcotte19999o Martin1998p Martin1999j Matsuno1996yMatthews1997yMatthews19977` Mau1997` Mayhew19977_McClarty20000kMcDonald19988[McKenney19977bMcKenney19979k McLeod19988a Merrick1995e Merrick1996i Mewes1996 Mewes1997 Michie1997{ Milburn2000yMillburn19977 Miller19979 Miller19977oMitchell19988^Mitchell1999jMiyajima1996\ Moszer19979+ Moult2000V Muchnik19982 Muhldorfer199779 Muller1999iMurakami1996j Muraki19966 Murzin1995h Nagai1998jNakamura1996jNakazaki1996hNakazawa19981j Naruo1996l Naslund1998[ Nelson19977b Nelson19979_ Nelson20000k Norris1998\ Ogasawara1997h Oguchi19988h Ogura1998h Ohfuku19989j Okumura1996^ Olinger1999i Oliver19969e Olsen1996Z Olsen1998+ Orban2000? Orengo1994 Orengo19978 Orengo19981o Orengo1998p Orengo1999z Orengo19999{ Orengo20000h Otsuka19989-Ouzounis19999eOverbeek19966ZOverbeek19989] Palmer19979T Park1998dParkhill1998ws Pawlowski1999r Pawlowski2000p Pearl1999z Pearl1999x Pearson1994  Pearson1997. Pellegrini1999/ Pellegrini1999D Pennisi1998` Perna1997N Persson1999[Peterson1997]Peterson19979bPeterson19979kPeterson19988_Peterson20000i Philippsen1996f Pirkl1996f Plagens1996`Plunkett1997lPodowski1998t+ Poljak20000c Pothier1997c Qiu1997[ Quackenbush1997] Quackenbush1997b Quackenbush1997_ Read2000c Reeve1997e Reich1996[ Richardson1997] Richardson19977b Richardson19977k Richardson1998+ Richardson2000` Riley1997y Rindone1997E Rivera1999G Rochon1999` Rode19979` Rose19979t Russell1997y Russo Twombly1997s Rychlewski1999r Rychlewski20000h Sakai1998Q Sali1998S Sali19989]Salzberg19979kSalzberg19988_Salzberg20000S Sanchez1998 Sander1998t Saqi19979jSasamoto1996j Sato19969h Sawada1998it Sayle1997 Schaffer19977Schuster1999e Scott1996h Sekine19989` Shao19979q Shapiro2000_ Shen20000j Shimpo19966Z Short1998lSicheritz-Ponten1998sSkolnick19999c Smith1997Z Snead19987 Snel19991k Sodergren1998O Sonnhammer1999c Spadafora1997*Stawiski2000^Stephens1999t Sternberg19979 Sternberg1999yStrelets19977U Subramanian2000j Sugiura19968 Sunyaev1998g Sutton19959e Sutton1996[ Sutton1997] Sutton19977b Sutton19977k Sutton19989Z Swanson1998 Swindells1997j Tabata19969hTakamiya19981jTakeuchi1996j Tanaka1996h Tanaka19989o Taroni199890 Tatusov1999T Teichmann19985 Teichmann1999d Tekaia19989iTettelin19969.Thompson19999/Thompson19999?Thornton1994Thornton19977oThornton19988pThornton1999zThornton1999{Thornton2000p Todd19999z Todd19999{ Todd20002y Tolstoshev19979a Tomb1995e Tomb1996[ Tomb19979] Tomb1997b Tomb19972 Tschape1997k Utterback1998_ Utterback2000uValencia2000]van Vugt19979e Venter19969[ Venter19979] Venter19977b Venter1997k Venter19988c Vicaire1997j Wada19961c Wang19979Z Warren19988jWatanabe1996_ Weidman2000e Weinstock1996k Weinstock1998a White1995g White1995e White1996[ White1997] White1997b White1997k White1998_ White2000y Whitfield1997c Wierzbowski1997m Wilson1999| Wilson2000l Winkler1998L Woese1987 Wolf19990 Wolf199993 Wolf1999 Wood19979j Yamada19969hYamamoto19981hYamazaki19988j Yasuda19969. Yeates19999/ Yeates19999Z Young19988 Yuan19989 Zhang1997 Zhang1997 Zhang1997s Zhang1999e Zhou19966b Zhou1997sl Zomorodipour1998Schaffer19977Schuster1999` Shao19979_ Shen20000Z Short1998c Smith1997Z Snead19987 Snel19991O Sonnhammer1999c Spadafora1997*Stawiski2000^Stephens19999 Sternberg1999U Subramanian20008 Sunyaev1998[ Sutton1997] Sutton19977b Sutton19977Z Swanson1998 Swindells19970 Tatusov1999T Teichmann19985 Teichmann1999d Tekaia19989.Thompson19999/Thompson19999?Thornton1994Thornton19977a Tomb1995[ Tomb19979] Tomb1997b Tomb19972 Tschape1997_ Utterback2000]van Vugt19979[ Venter19979] Venter19977b Venter1997c Vicaire1997c Wang19979Z Warren19988_ Weidman2000a White1995e White1996[ White1997] White1997b White1997_ White2000c Wierzbowski1997L Woese1987 Wolf19990 Wolf199993 Wolf1999 Wood19979. Yeates19999/ Yeates19999Z Young19988 Yuan19989 Zhang1997 Zhang1997 Zhang1997e Zhou19966b Zhou1997s" ( #%!/"3$-:'+,19 0 Authors) Journals ' Keywords *                               09 ) P Adams, M. D. Aebersold, R.Alberro, M. R.Albertini, A. M. Aldredge, T. Alloni, G.Alsmark, U. C.Altschul, S. F.Anderson, K. L.Andersson, J. O.Andersson, S. G. Aoki, K. Apweiler, R. Aravind, L. Artiach, P. Asamizu, E. Ashburner, M. Aujay, M. Azevedo, V. Baba, S. Badcock, K. Bairoch, A. Banerjei, L.Barrell, B. G.Barry, C. E., 3rd Bash, P. A. Basham, D.Bashirzadeh, R. Bass, S. Bates, P. A. Baucom, A. E. Berry, K.Bertero, M. G. Bessieres, P. Blake, J. A. Blakely, D.Blattner, F. R. Bloch, C. A.Blum-Oehler, G. Bolotin, A. Borchert, S. Bork, P. Borkakoti, N. Borriss, R. Boursier, L. Bowman, C. Brans, A. Braun, M. Bray, J. E. Brenner, S EBrenner, S. E.Brignell, S. C. Bron, S. Brosch, R. Brouillet, S. Brown, D. Brown, P. O.Brunham, R. C.Bruschi, C. V. Bult, C. J. Burland, V. Bush, D. Bussey, H. Caldwell, B. Capuano, V. Carter, N. M. Caruso, A. Casjens, S.Chidambaram, M. Chillemi, J.Chillingworth, T. Choi, S. K. Chothia, C Chothia, C. Churcher, C. Clayton, R.Clayton, R. A. Codani, J. J. Cole, S. T.Collado-Vides, J.Connerton, I. F. Connor, R. Cook, R. Cotton, M. D. Craven, B. Crosby, M. Danchin, A. Dandekar, T. Das, R Davies, R. Davis, N. W. Davis, R. W. de Grey, A. DeBoy, R. Deckert, G.Deloughery, C. DeRisi, J. L. Devlin, K. Devos, D. Dodson, R. Dodson, R. J. Doerks, T.Doolittle, R. F.Doolittle, W. F.Doucette-Stamm, L. A. Dougherty, B.Dougherty, B. A.Drysdale, R. A. Dubchak, I. Dubois, J. Dujon, B. Eiglmeier, K. Eisen, J. Eisenberg, D.Eisenhaber, F.Eisenstein, E. Elofsson, A. Emmert, D.Enright, A. J.Eriksson, A. S. et al. Fan, J.Feldman, R. A. Feldmann, H. Feltwell, T. Feng, D. F. Fetrow, J. S. Fischer, D.Fitzegerald, L. M.FitzGerald, L. M.Fleischmann, R. D. Franza, B. R. Fraser, C. M. Frishman, D. Funahashi, T.Gaasterland, T. Galibert, F.Galperin, M. Y. Garnier, T. Gas, S.Gelbart, W. M. Gentles, S.Geoghagen, N. S. M. Gerstein, M Gerstein, M. Gibson, R. Gilbert, D. Gilbert, K. Gill, S. Gill, S. R.Gilliland, G. L.Glasner, J. D. Glodek, A. Gocayne, J.Gocayne, J. D. Godzik, A. Goeden, M. A. Goffeau, A. Gordon, S. V. Graham, D. E. Gregor, J.Gregoret, L. M. Grimwood, J.Grishin, N. V. Gwinn, M. Gygi, S. P. Hacker, J. Haikawa, Y. Hamlin, N. Hanson, M.Hardham, J. M. Harris, D. Harris, T. Harrison, D. Hegyi, H Hegyi, H.Heidelberg, J. F. Herrmann, R. Herzberg, O. Hickey, E. K. Hilbert, H.Himmelreich, R. Hino, Y. Hirosawa, M. Hoang, L.Hoheisel, J. D. Holm, L. Holroyd, S. Horikawa, H. Hornsby, T. Hosouchi, T. Hosoyama, A. Howard, A. J. Howell, J. K. Huang, W. M. Hubbard, T Huber, R.Hutchinson, E. G. Huynen, M. Huynen, M. A. Hyman, R. W.Iliopoulos, I. Iyer, V. R. Jacq, C. Jagels, K. Jain, R. Jansen, R  ' Biochem JBioinformaticsCurr Opin BiotechnolCurr Opin Genet DevCurr Opin Struct Biol DNA Res EMBO J.FEMS Microbiol Rev Fold Des$!Functional & Integrative Genomics Genome ResGenome Researchw Genomics Infect Immun J Bacteriolw J Mol Biol@w J Mol Evol@w J. Mol. Biol.Methods Mol BiolMicrob Comp Genomics Microbiol Rev Mol Cell Biol Mol Microbiol Nat GenetNat Struct Biol NatureNature Struct. Biol.Nuc. Acids Res.wNucleic Acids ResPac Symp BiocomputPac. Symp. Biocomp.Proc Natl Acad Sci U S A Protein Sci Proteins Science StructureTrends Cell Biolw Trends GenetTrends Microbiolw  !: (1 "3'#:3 ($%/++1/##-+ 9, -( ,0+#(19   8L *Algorithms*Amino Acid Sequence$!*Bacteria/classification/genetics*Base Sequenceque*Chromosome Mapping *Chromosome Mapping/methodsZ*Computational Biologyemi*Databases, Factualio *Databases, Factual/trendsZ*DNA Replicationi *Enzymese *Evolutionnce*Evolution, Molecular*Gene Duplication(#*Gene Expression Regulation, Fungalol *Gene Fusionc*Gene Rearrangement*g*Genes, ArchaealR*Genes, Bacterial*Genes, Fungals/c*Genes, Helminthc$ *Genetics, Biochemical/economics *Genomeio*Genome, Bacterialene*Genome, Fungalia*Models, Molecularhem*Models, Theoreticalo *Phylogenycer*Protein Conformation*Protein Foldingt *Protein Structure, Secondary *Protein Structure, Tertiary*Recombination, Geneticab*Reference Standardsc(#*Repetitive Sequences, Nucleic Acidci$*RNA, Ribosomal/classificationion*Sequence AlignmentDa*Sequence Analysis, DNAsi$*Sequence Analysis, DNA/methodsci$*Sequence Homology, Amino Acidy/g *Software *Statistical DistributionsyZ$ *Structure-Activity Relationshipr *TerminologyF *Transfection*Variation (Genetics),'Adenine Nucleotide Translocase/geneticsADKE: AlgorithmsAmino Acid Sequence Anaerobiosis AnimalAci(#Anti-Infective Agents/*pharmacologyAntigenic Variation Antigenic Variation/geneticsArchaea/*chemistry Archaea/genetics/metabolismZ Archaeal Proteins/geneticsZ<6Archaeoglobus fulgidus/*genetics/metabolism/physiology Automation Se,&Bacillus subtilis/*genetics/metabolismEBacteria/*chemistry0-Bacteria/*drug effects/genetics/pathogenicityBacteria/*genetics Bacteria/genetics/metabolismBacterial Adhesionn Bacterial Proteins/*chemistry,'Bacterial Proteins/*chemistry/*geneticsDABacterial Proteins/*chemistry/classification/genetics/*metabolism Bacterial Proteins/*genetics,'Bacterial Proteins/*genetics/metabolism82Bacterial Proteins/*genetics/metabolism/physiology(%Bacterial Proteins/analysis/chemistryD?Bacterial Proteins/biosynthesis/*chemistry/genetics/*metabolism,&Bacterial Proteins/chemistry/*genetics=(%Bacterial Proteins/chemistry/genetics9 o.f\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010573421APJMarcotte, E. M. Pellegrini, M. Thompson, M. J. Yeates, T. O. Eisenberg, D.XRA combined algorithm for genome-wide prediction of protein function [see comments]*Algorithms Colorectal Neoplasms/etiology Evolution, Molecular Fungal Proteins/classification/genetics/*physiology Human Phylogeny Prions/classification/physiology RNA, Messenger/biosynthesis Saccharomyces cerevisiae Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S.:4The availability of over 20 fully sequenced genomes has driven the development of new methods to find protein function and interactions. Here we group proteins by correlated evolution, correlated messenger RNA expression patterns and patterns of domain fusion to determine functional relationships among the 6,217 proteins of the yeast Saccharomyces cerevisiae. Using these methods, we discover over 93,000 pairwise links between functionally related yeast proteins. Links between characterized and uncharacterized proteins allow a general function to be assigned to more than half of the 2,557 previously uncharacterized yeast proteins. Examples of functional links are given for a protein family of previously unknown function, a protein whose human homologues are implicated in colon cancer and the yeast prion Sup35.'Molecular Biology Institute, UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, University of California, Los Angeles 90095, USA.60Comment in: Nature 1999 Nov 4;402(6757):23, 25-6 0010573421 Nature 1999 402e 6757 83-6Martin, A. C. Orengo, C. A. Hutchinson, E. G. Jones, S. Karmirantzou, M. Laskowski, R. A. Mitchell, J. B. Taroni, C. Thornton, J. M."Protein folds and functionsOrkBinding Sites Carbohydrates/chemistry/metabolism DNA/metabolism DNA-Binding Proteins/chemistry/classification/metabolism Enzymes/chemistry/metabolism Heme/chemistry/metabolism Models, Molecular *Models, Theoretical Nucleotides/metabolism Protein Conformation *Protein Folding Proteins/chemistry/*classification/*metabolism Software Structure-Activity RelationshipBACKGROUND: The recent rapid increase in the number of available three- dimensional protein structures has further highlighted the necessity to understand the relationship between biological function and structure. Using structural classification schemes such as SCOP, CATH and DALI, it is now possible to explore global relationships between protein fold and function, something which was previously impractical. RESULTS: Using a relational database of CATH data we have generated fold distributions for arbitrary selections of proteins automatically. These distributions have been examined in the light of protein function and bound ligand. Different enzyme classes are not clearly reflected in distributions of protein class and architecture, whereas the type of bound ligand has a much more dramatic effect. CONCLUSIONS: The availability of structural classification data has enabled this novel overview analysis. We conclude that function at the top level of the EC number enzyme classification is not related to fold, as only a very few specific residues are actually responsible for enzyme activity. Conversely, the fold is much more closely related to ligand type.'XQDepartment of Biochemistry and Molecular Biology, University College, London, UK.i 0009687369}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009687369 http://www.biomednet.com/article/st6706e Structurey 19986d7  875-84\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=00105472992,Muller, A. MacCallum, R. M. Sternberg, M. J.2+Benchmarking PSI-BLAST in genome annotationnAlgorithms Bacterial Proteins/*chemistry/classification/genetics/*metabolism Benchmarking *Computational Biology Conserved Sequence Databases, Factual False Positive Reactions *Genome, Bacterial Internet Multigene Family Mycobacterium tuberculosis/chemistry/enzymology/genetics Mycoplasma/chemistry/enzymology/genetics Open Reading Frames/genetics Sensitivity and Specificity Sequence Alignment *Sequence Homology, Amino Acid *Software Structure-Activity Relationship`YThe recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations. Copyright 1999 Academic Press.'|uBiomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England. 0010547299 J Mol Biol 1999 29351257-71.(A Murzin S E Brenner T Hubbard C Chothia 1995f_SCOP: A Structural Classification of Proteins for the Investigation of Sequences and Structures J. Mol. Biol. 247536-540.'URL: http://prosci.biomol.uci.edu/scop/y 1X$Galperin, M. Y. Koonin, E. V.D6/Searching for drug targets in microbial genomestAnti-Infective Agents/*pharmacology Bacteria/*drug effects/genetics/pathogenicity Carrier Proteins/drug effects Enzymes/drug effects *Genome, Bacterial Membrane Proteins/drug effects Species Specificity Virulence/geneticsnComparative analysis of the complete genome sequences of 10 bacterial pathogens available in the public databases offers the first insights into the drug discovery approaches of the near future. Genes that are conserved in different genomes often turn out to be essential, which makes them attractive targets for new broad-spectrum antibiotics. Subtractive genome analysis reveals the genes that are conserved in all or most of the pathogenic bacteria but not in eukaryotes; these are the most obvious candidates for drug targets. Species-specific genes, on the other hand, may offer the possibility to design drugs against a particular, narrow group of pathogens.'National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. galperin@ncbi.nlm.nih.gov.c 0010600691}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010600691 http://www.biomednet.com/article/bta606bCurr Opin Biotechnol 1999106u 571-8oGelbart, W. M. Crosby, M. Matthews, B. Rindone, W. P. Chillemi, J. Russo Twombly, S. Emmert, D. Ashburner, M. Drysdale, R. A. Whitfield, E. Millburn, G. H. de Grey, A. Kaufman, T. Matthews, K. Gilbert, D. Strelets, V. Tolstoshev, C.<6FlyBase: a Drosophila database. The FlyBase consortiumAnimal *Base Sequence *Databases, Factual Drosophila melanogaster/*genetics Genes, Insect Support, Non-U.S. Gov't Support, U.S. Gov't, P.H.S. "FlyBase is a database of genetic and molecular data concerning Drosophila. FlyBase is maintained as a relational database (in Sybase) and is made available as html documents and flat files. The scope of FlyBase includes: genes, alleles (and phenotypes), aberrations, transposons, pointers to sequence data, clones, stock lists, Drosophila workers and bibliographic references. The Encyclopedia of Drosophila is a joint effort between FlyBase and the Berkeley Drosophila Genome Project which integrates FlyBase data with those from the BDGP.'VOFlyBase, Biological Laboratories, 16 Divinity Avenue, Cambridge, MA 02138, USA.t 0009045212http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009045212 http://www.oup.co.uk/nar/Volume_25/Issue_01/gka012_gml.abs.htmlbNucleic Acids Ress 1997251d 63-6 M Gerstein 1997xrA Structural Census of Genomes: Comparing Eukaryotic, Bacterial and Archaeal Genomes in terms of Protein Structure J. Mol. Biol. 274562-576 M Gerstein 1998b\Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural CensusProteins33518-534n  \n 4lwM`e~d7XZFu,KV+O-WPa]gk1y;>=C<JiG2B}f86H:N^jh[\EA@I0.o9?prx /D_tQSqc*U5Tz{b|mL3s?߿ition-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used PWariation/geneticsf@F PjY( %iY,`kY PWY> Bacterial Proteins/geneticssf@F PjY(\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009342339g Fischer, D. Eisenberg, D.gTNAssigning folds to the proteins encoded by the genome of Mycoplasma genitaliumpiAlgorithms Amino Acid Sequence Bacterial Proteins/*chemistry Confidence Intervals Databases, Factual Forecasting *Genome, Bacterial Hydrolases/chemistry Membrane Proteins/chemistry Models, Molecular Mycoplasma/*genetics Nucleoside-Phosphate Kinase/chemistry *Protein Folding *Protein Structure, Secondary Sequence Alignment/methods Sequence Homology, Amino AcidahbA crucial step in exploiting the information inherent in genome sequences is to assign to each protein sequence its three-dimensional fold and biological function. Here we describe fold assignment for the proteins encoded by the small genome of Mycoplasma genitalium. The assignment was carried out by our computer server (http://www.doe- mbi.ucla.edu/people/frsvr/ frsvr. html), which assigns folds to amino acid sequences by comparing sequence-derived predictions with known structures. Of the total of 468 protein ORFs, 103 (22%) can be assigned a known protein fold with high confidence, as cross-validated with tests on known structures. Of these sequences, 75 (16%) show enough sequence similarity to proteins of known structure that they can also be detected by traditional sequence-sequence comparison methods. That is, the difference of 28 sequences (6%) are assignable by the sequence- structure method of the server but not by current sequence-sequence methods. Of the remaining 78% of sequences in the genome, 18% belong to membrane proteins and the remaining 60% cannot be assigned either because these sequences correspond to no presently known fold or because of insensitivity of the method. At the current rate of determination of new folds by x-ray and NMR methods, extrapolation suggests that folds will be assigned to most soluble proteins in the next decade.'University of California, Los Angeles-Department of Energy Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, University of California, Los Angeles, Box 951570, Los Angeles, CA 90095-1570, USA. 0009342339Proc Natl Acad Sci U S A 1997942211929-34 Fischer, D. Eisenberg, D./0)Predicting structures for genome proteinsbAmino Acid Sequence Crystallization Genome, Human Human Internet Models, Molecular *Protein Conformation Protein Folding Proteins/*chemistry/*genetics Sequence Homology, Amino Acid82Assigning three-dimensional protein folds to genome sequences is essential to understanding protein function. Although experimental three-dimensional structures are currently available for only a very small fraction of these sequences, computational fold assignment is able to assign folds to 20-30% of the sequences in various genomes. This percentage varies depending on the particular organism under analysis, on the sensitivities of the methods used and on the number of experimental structures available at the time the assignment is carried out. The fraction of assignable sequences is currently increasing at an annual rate of roughly 18%. If this rate is sustained throughout the coming years, three-dimensional computational models for more than half of the genome sequences may be available by the year 2003.'|uFaculty of Natural Science, Department of Math and Computer Science, Beer-Sheva, 84015, Israel. dfischer@cs.bgu.ac.ilc 0010322219}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010322219 http://www.biomednet.com/article/sb9215)Curr Opin Struct Biolu 199992a 208-11Nhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009389475olfKlenk, H. P. Clayton, R. A. Tomb, J. F. White, O. Nelson, K. E. Ketchum, K. A. Dodson, R. J. Gwinn, M. Hickey, E. K. Peterson, J. D. Richardson, D. L. Kerlavage, A. R. Graham, D. E. Kyrpides, N. C. Fleischmann, R. D. Quackenbush, J. Lee, N. H. Sutton, G. G. Gill, S. Kirkness, E. F. Dougherty, B. A. McKenney, K. Adams, M. D. Loftus, B. Venter, J. C. et al.,The complete genome sequence of the hyperthermophilic, sulphate- reducing archaeon Archaeoglobus fulgidus [published erratum appears in Nature 1998 Jul 2;394(6688):101]&Archaeoglobus fulgidus/*genetics/metabolism/physiology Base Sequence Cell Division DNA, Bacterial/genetics Energy Metabolism Gene Expression Regulation, Bacterial *Genes, Archaeal *Genome Molecular Sequence Data Support, U.S. Gov't, Non-P.H.S. Transcription, Genetic Translation, GeneticArchaeoglobus fulgidus is the first sulphur-metabolizing organism to have its genome sequeKallberg, Y. Persson, B.,%KIND-a non-redundant protein databasenyAlgorithms *Databases, Factual Open Reading Frames Proteins/*genetics Sequence Alignment Software Support, Non-U.S. Gov'tKSUMMARY: KIND (Karolinska Institutet Nonredundant Database) is a protein database where identical sequences, both full length and partial, have been removed. The database contains nearly 274 900 sequences, half of which originate from the protein sequence databases Swissprot and PIR, while the other half come from translated open reading frames in GenPept and TrEMBL. AVAILABILITY: KIND is downloadable from ftp://ftp.mbb.ki.se/pub/KIND.'~Department of Medical Biochemistry and Biophysics, Karolinska Institutet,S-171 77 Stockholm, Sweden. yvonne.kallberg@mbb.ki.se 0010222415http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010222415 http://www.oup.co.uk/bioinformatics/hdb/Volume_15/Issue_03/btc038_gml.abs.htmlBioinformatics 1999153  260-1l 8* n Eukaryotic Cells/metabolismZEuryarchaeota/*geneticsliEvaluation Studiesl EvolutionEvolution, Molecular/Expressed Sequence TagsfiFalse Positive Reactionsi FermentationP ForecastingFaFungal Proteins/*analysis Fungal Proteins/*chemistryZ($Fungal Proteins/*chemistry/*geneticsi(#Fungal Proteins/*chemistry/genetics0-Fungal Proteins/chemistry/genetics/physiology83Fungal Proteins/classification/genetics/*physiologyFungal Proteins/geneticsi(#Fungal Proteins/genetics/metabolismol(%Gene Expression Regulation, Bacterial Gene Librarya Gene TransferGenes, ArchaealecGenes, Archaeal/geneticsiGenes, BacterialeGenes, Bacterial/geneticsGenes, Duplicate/genetics($Genes, Essential/genetics/physiologym Genes, Fungal Genes, InsectGenes, Regulator  Genes, rRNA MGenes/geneticstua@Versatile gene uptake system found in cholera bacterium [news]Cloning, Molecular Drug Resistance, Microbial/genetics Escherichia coli/genetics/pathogenicity *Genes, Bacterial Integrase/*genetics/metabolism *Recombination, Genetic *Repetitive Sequences, Nucleic Acid Vibrio cholerae/enzymology/*genetics/pathogenicity Virulence/geneticsr 0009575097Sciencee 1998 280c 5363 521-2iO Elofsson, A. Sonnhammer, E. L.haA comparison of sequence and structure protein domain families as a basis for struct$Elofsson, A. Sonnhammer, E. L.haA comparison of sequence and structure protein domain families as a basis for structural genomicslComparative Study Computational Biology *Databases, Factual Genome Protein Folding Proteins/*chemistry/classification/*genetics Sequence Alignment Sequence Homology, Amino Acid Support, Non-U.S. Gov'tMOTIVATION: Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs. RESULTS: We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.'d^Department of Biochemistry, Stockholm University, 106 91 Stockholm, Sweden. arne@biokemi.su.se 0010383473http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010383473 http://www.oup.co.uk/bioinformatics/hdb/Volume_15/Issue_06/btc086_gml.abs.htmlBioinformatics 1999156480-500+NVTKp,6\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0002124629kB;Doolittle, R. F. Feng, D. F. Anderson, K. L. Alberro, M. R.sVOA naturally occurring horizontal gene transfer from a eukaryote to a prokaryoter"Amino Acid Sequence Escherichia coli/enzymology/*genetics Glyceraldehyde-3-Phosphate Dehydrogenases/*genetics Methanogens/enzymology/*genetics Molecular Sequence Data Phosphoglycerate Kinase/genetics *Phylogeny Rhizobium/enzymology/*genetics Support, U.S. Gov't, P.H.S. *TransfectionxqNaturally occurring horizontal gene transfers between nonviral organisms are difficult to prove. Only with the availability of sequence data from a wide variety of organisms can a convincing case be made. In the case of putative gene transfers between prokaryotes and eukaryotes, the minimum requirements for inferring such an event include (1) sequences of the transferred gene or its product from several appropriately divergent eukaryotes and several prokaryotes, and (2) a similar set of sequences from the same (or closely related organisms) for another gene or genes. Given these criteria, we believe that a strong case can be made for Escherichia coli having acquired a second glyceraldehyde-3-phosphate dehydrogenase gene from some eukaryotic host. Ancillary observations on the general rate of change and the time of the prokaryote-eukaryote divergence support the notion.'`YCenter for Molecular Genetics M-034, University of California, San Diego, La Jolla 92093. 0002124629 J Mol Evol 1990315n 383-8oDoolittle, W. F.Lateral genomics~Evolution, Molecular Genome Phylogeny Prokaryotic Cells/*classification *RNA, Ribosomal/classification Support, Non-U.S. Gov'tMore than 20 complete prokaryotic genome sequences are now publicly available, each by itself an unparalleled resource for understanding organismal biology. Collectively, these data are even more powerful: they could force a dramatic reworking of the framework in which we understand biological evolution. It is possible that a single universal phylogenetic tree is not the best way to depict relationships between all living and extinct species. Instead a web- or net-like pattern, reflecting the importance of horizontal or lateral gene transfer between lineages of organisms, might provide a more appropriate visual metaphor. Here, I ask whether this way of thinking is really justified, and explore its implications.'Canadian Institute for Advanced Research, Dept of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada. ford@is.dal.ca 0010611671http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010611671 http://www.biomednet.com/library/fulltext/TCB.etd00182_09628924_v0009i12_00001664 http://www.biomednet.com/library/abstract/TCB.etd00182_09628924_v0009i12_00001664Trends Cell Biol 1999912 M5-8\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009775387w("Dubchak, I. Muchnik, I. Kim, S. H.VOAssignment of folds for proteins of unknown function in three microbial genomes9$Amino Acid Sequence Bacterial Proteins/*genetics Computational Biology/classification Databases, Factual/classification Genome, Bacterial Haemophilus influenzae/genetics Methanococcus/genetics Molecular Sequence Data Mycoplasma/genetics *Protein Folding Support, U.S. Gov't, Non-P.H.S.MAnalysis of DNA sequences of several microbial genomes has revealed that a large fraction of predicted coding regions has no known protein function. Information about the three-dimensional folds of these proteins may provide insight into their possible functions. To predict the folds for protein sequences with little or no homology to proteins of known function, we used computational neural networks trained on the database of proteins with known three-dimensional structures. Global descriptions of protein sequences based on physical and structural properties of the constituent amino acids were used as inputs for neural networks. Of the 131, 498, and 868 protein sequences of unknown function from Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii (Fleischmann et al. 1995), we have made high- confidence fold assignments for 4, 10, and 19 sequences, respectively.e'\UE. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, USA.o 0009775387 1998Microb Comp Genomics33 171-5 Using Smart Source Parsing~xEisenstein, E. Gilliland, G. L. Herzberg, O. Moult, J. Orban, J. Poljak, R. J. Banerjei, L. Richardson, D. Howard, A. J.jdBiological function made crystal clear - annotation of hypothetical proteins via structural genomicsBacterial Proteins/biosynthesis/*chemistry/genetics/*metabolism Crystallization Crystallography, X-Ray Genes, Essential/genetics/physiology *Genome, Bacterial Haemophilus influenzae/*chemistry/enzymology/*genetics Nuclear Magnetic Resonance, Biomolecular Protein Binding Protein Conformation Recombinant Fusion Proteins/biosynthesis/chemistry/genetics/metabolism Structure-Activity Relationship Support, U.S. Gov't, P.H.S.6/Many of the gene products of completely sequenced organisms are 'hypothetical' - they cannot be related to any previously characterized proteins - and so are of completely unknown function. Structural studies provide one means of obtaining functional information in these cases. A 'structural genomics' project has been initiated aimed at determining the structures of 50 hypothetical proteins from Haemophilus influenzae to gain an understanding of their function. Each stage of the project - target selection, protein production, crystallization, structure determination, and structure analysis - makes use of recent advances to streamline procedures. Early results from this and similar projects are encouraging in that some level of functional understanding can be deduced from experimentally solved structures.c'Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, National Institute of Standards and Technology, Rockville, MD 20850, USA. 0010679350}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010679350 http://www.biomednet.com/article/btb116iCurr Opin Biotechnol 2000111e 25-30a:H68Holm, L. Sander, C.S0)Touring protein fold space with Dali/FSSPComputer Communication Networks *Databases, Factual Information Storage and Retrieval Protein Conformation *Protein Folding Proteins/*chemistryrThe FSSP database and its new supplement, the Dali Domain Dictionary, present a continuously updated classification of all known 3D protein structures. The classification is derived using an automatic structure alignment program (Dali) for the all-against-all comparison of structures in the Protein Data Bank. From the resulting enumeration of structural neighbours (which form a surprisingly continuous distribution in fold space) we derive a discrete fold classification in three steps: (i) sequence-related families are covered by a representative set of protein chains; (ii) protein chains are decomposed into structural domains based on the recurrence of structural motifs; (iii) folds are defined as tight clusters of domains in fold space. The fold classification, domain definitions and test sets for sequence-structure alignment (threading) are accessible on the web at www.embl-ebi.ac.uk/dali . The web interface provides a rich network of links between neighbours in fold space, between domains and proteins, and between structures and sequences leading, for example, to a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity. The Dali/FSSP organization of protein structures provides a map of the currently known regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination.Nucleic Acids Res 1998261 316-9)\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009665839pTMHuynen, M. Doerks, T. Eisenhaber, F. Orengo, C. Sunyaev, S. Yuan, Y. Bork, P.1HBHomology-based fold predictions for Mycoplasma genitalium proteinsBacterial Proteins/*chemistry Mycoplasma/*chemistry Protein Conformation *Protein Folding Sequence Homology Support, Non-U.S. Gov'tb[Homology search techniques based on the iterative PSI-BLAST method in combination with various filters for low sequence complexity are applied to assign folds to all Mycoplasma genitalium proteins. The resulting procedure (implemented as a web server) is able to predict at least one domain in 37% of these proteins automatically, with an estimated accuracy higher than 98%. Taking structural features such as coiled coil or transmembrane regions aside, folds can be assigned to more than half of the globular proteins in a bacterium just by iterative sequence comparison. Copyright 1998 Academic Press.'d]EMBL, Max-Delbruck-Center for Molecular Medicine, Meyerhoftstr.1, Heidelberg, 69012, Germany. 0009665839 J Mol Biol 1998 2803 323-6*#Huynen, M. A. Dandekar, T. Bork, P.NGVariation and evolution of the citric-acid cycle: a genomic perspectivenArchaea/genetics/metabolism Bacteria/genetics/metabolism Citric Acid Cycle/*genetics *Evolution, Molecular Support, Non-U.S. Gov't *Variation (Genetics) Yeasts/genetics/metabolismlThe presence of genes encoding enzymes involved in the citric-acid cycle has been studied in 19 completely sequenced genomes. In the majority of species, the cycle appears to be incomplete or absent. Several distinct, incomplete cycles reflect adaptations to different environments. Their distribution over the phylogenetic tree hints at precursors in the evolution of the citric-acid cycle.'tnEuropean Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany. huynen@embl-heidelberg.de 0010390638http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010390638 http://www.biomednet.com/library/fulltext/TIM.timi0799_0966842x_v0007i07_00001539 http://www.biomednet.com/library/abstract/TIM.timi0799_0966842x_v0007i07_00001539Trends Microbiol 199977 281-91R Jansen M Gerstein 2000}Analysis of the Yeast Transcriptome with Broad Structural and Functional Categories: Characterizing Highly Expressed ProteinsnNuc. Acids Res.s28 1481-1488.\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010191147l Jones, D. T.b\GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequencesAlgorithms Amino Acid Sequence *Genome Molecular Sequence Data Neural Networks (Computer) Open Reading Frames *Protein Conformation *Protein Folding Reproducibility of Results Sequence Alignment/*methods Sequence Homology, Amino Acidi82A new protein fold recognition method is described which is both fast and reliable. The method uses a traditional sequence alignment algorithm to generate alignments which are then evaluated by a method derived from threading techniques. As a final step, each threaded model is evaluated by a neural network in order to produce a single measure of confidence in the proposed prediction. The speed of the method, along with its sensitivity and very low false-positive rate makes it ideal for automatically predicting the structure of all the proteins in a translated bacterial genome (proteome). The method has been applied to the genome of Mycoplasma genitalium, and analysis of the results shows that as many as 46 % of the proteins derived from the predicted protein coding regions have a significant relationship to a protein of known structure. In some cases, however, only one domain of the protein can be predicted, giving a total coverage of 30 % when calculated as a fraction of the number of amino acid residues in the whole proteome. Copyright 1999 Academic Press.'voDepartment of Biological Sciences, University of Warwick, Coventry, CV4 7AL, UK. jones@globin.bio.warwick.ac.ukn 0010191147 J Mol Biol 1999 287e4 797-815 ; (Gerstein, M.rkMeasurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequencerAmino Acid Sequence *Databases, Factual Protein Conformation Proteins/*chemi Gerstein, M.rkMeasurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequencerAmino Acid Sequence *Databases, Factual Protein Conformation Proteins/*chemistry *Sequence Alignment Support, U.S. Gov't, Non-P.H.S.f`MOTIVATION: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, 'intermediate' sequence (Q --> I --> M ). This approach has often been suggested as providing greater sensitivity in sequence comparison; however, it has not yet been possible to gauge its improvement precisely. RESULTS: Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called 'pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called 'baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison (i.e. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0. 001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular, 862 of the baseline set pairs fit to better than 2.6 A RMS, and transitive matching can find 62 of these (9%). AVAILABILITY: All the test data, including precise similarity values calculated from structural alignment, are available in tabular format over the Web from http://bioinfo.mbb. yale.edu/align. CONTACT: Mark.Gerstein@yale.edu'Department of Molecular Biophysics and Biochemistry, 266 Whitney Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA. Mark.Gerstein@yale.ed 0009789096http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009789096 http://www.oup.co.uk/bioinformatics/hdb/Volume_14/Issue_08/btb111_gml.abs.html 1998Bioinformatics148 707-14 Using Smart Source Parsing2Gni\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008849441rGoffeau, A. Barrell, B. G. Bussey, H. Davis, R. W. Dujon, B. Feldmann, H. Galibert, F. Hoheisel, J. D. Jacq, C. Johnston, M. Louis, E. J. Mewes, H. W. Murakami, Y. Philippsen, P. Tettelin, H. Oliver, S. G.A*#Life with 6000 genes [see comments] Amino Acid Sequence Base Sequence *Chromosome Mapping Chromosomes, Fungal/genetics Computer Communication Networks DNA, Fungal/genetics Evolution, Molecular Fungal Proteins/chemistry/genetics/physiology Gene Library *Genes, Fungal *Genome, Fungal International Cooperation Multigene Family Open Reading Frames RNA, Fungal/genetics Saccharomyces cerevisiae/*genetics Sequence Analysis, DNAThe genome of the yeast Saccharomyces cerevisiae has been completely sequenced through a worldwide collaboration. The sequence of 12,068 kilobases defines 5885 potential protein-encoding genes, approximately 140 genes specifying ribosomal RNA, 40 genes for small nuclear RNA molecules, and 275 transfer RNA genes. In addition, the complete sequence provides information about the higher order organization of yeast's 16 chromosomes and allows some insight into their evolutionary history. The genome shows a considerable amount of apparent genetic redundancy, and one of the major problems to be tackled during the next stage of the yeast genome project is to elucidate the biological functions of all of these genes.'}Universite Catholique de Louvain, Unite de Biochimie Physiologique, Place Croix du Sud, 2/20, 1348 Louvain-la-Neuve, Belgium.s60Comment in: Science 1997 Feb 21;275(5303):1051-2 0008849441Science  1996 274g 5287 546, 563-782Gygi, S. P. Rochon, Y. Franza, B. R. Aebersold, R.>7Correlation between protein and mRNA abundance in yeastnCodon Fungal Proteins/*analysis *Gene Expression Regulation, Fungal RNA, Fungal/*analysis RNA, Messenger/*analysis Saccharomyces cerevisiae/*genetics/metabolism Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S.We have determined the relationship between mRNA and protein expression levels for selected genes expressed in the yeast Saccharomyces cerevisiae growing at mid-log phase. The proteins contained in total yeast cell lysate were separated by high-resolution two-dimensional (2D) gel electrophoresis. Over 150 protein spots were excised and identified by capillary liquid chromatography-tandem mass spectrometry (LC-MS/MS). Protein spots were quantified by metabolic labeling and scintillation counting. Corresponding mRNA levels were calculated from serial analysis of gene expression (SAGE) frequency tables (V. E. Velculescu, L. Zhang, W. Zhou, J. Vogelstein, M. A. Basrai, D. E. Bassett, Jr., P. Hieter, B. Vogelstein, and K. W. Kinzler, Cell 88:243- 251, 1997). We found that the correlation between mRNA and protein levels was insufficient to predict protein expression levels from quantitative mRNA data. Indeed, for some genes, while the mRNA levels were of the same value the protein levels varied by more than 20-fold. Conversely, invariant steady-state levels of certain proteins were observed with respective mRNA transcript levels that varied by as much as 30-fold. Another interesting observation is that codon bias is not a predictor of either protein or mRNA levels. Our results clearly delineate the technical boundaries of current approaches for quantitative analysis of protein expression and reveal that simple deduction from mRNA transcript analysis is insufficient.'leDepartment of Molecular Biotechnology, University of Washington, Seattle, Washington 98195-7730, USA. 0010022859http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010022859 http://mcb.asm.org/cgi/content/full/19/3/1720  Mol Cell Biolb 1999193l1720-30t\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009106201.<5Hacker, J. Blum-Oehler, G. Muhldorfer, I. Tschape, H.JhaPathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution Chromosomes/*genetics Evolution Genes, Bacterial Gram-Negative Bacteria/*genetics/pathogenicity Gram-Positive Bacteria/*genetics/pathogenicity Support, Non-U.S. Gov't Virulence/genetics/physiology2+Virulence genes of pathogenic bacteria, which code for toxins, adhesins, invasins or other virulence factors, may be located on transmissible genetic elements such as transposons, plasmids or bacteriophages. In addition, such genes may be part of particular regions on the bacterial chromosomes, termed 'pathogenicity islands' (Pais). Pathogenicity islands are found in Gram-negative as well as in Gram-positive bacteria. They are present in the genome of pathogenic strains of a given species but absent or only rarely present in those of non-pathogenic variants of the same or related species. They comprise large DNA regions (up to 200 kb of DNA) and often carry more than one virulence gene, the G + C contents of which often differ from those of the remaining bacterial genome. In most cases, Pais are flanked by specific DNA sequences, such as direct repeats or insertion sequence (IS) elements. In addition, Pais of certain bacteria (e,g. uropathogenic Escherichia coli, Yersinia spp., Helicobacter pylori) have the tendency to delete with high frequencies or may undergo duplications and amplifications. Pais are often associated with tRNA loci, which may represent target sites for the chromosomal integration of these elements. Bacteriophage attachment sites and cryptic genes on Pais, which are homologous to phage integrase genes, plasmid origins of replication of IS elements, indicate that these particular genetic elements were previously able to spread among bacterial populations by horizontal gene transfer, a process known to contribute to microbial evolution.'rkInstitut fur Molekulare Infektionsbiologie, Rontgenring, Wurzburg, Germany. j.hacker@rzbox.uni-wuerzburg.de 0009106201 Mol Microbiol 19972361089-97< fGerstein, M.rkMeasurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequencerAmino Acid Sequence *Databases, Factual Protein Conformation Proteins/*chemistry *Sequence Alignment Support, U.S. Gov't, Non-P.H.S.f`MOTIVATION: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009521122aGerstein, M. Levitt, M.eyComprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteinsjAmino Acid Sequence Automation Molecular Sequence Data Protein Conformation Proteins/*chemistry/classification *Reference Standards Sequence Alignment/methods/*standards Support, U.S. Gov't, Non-P.H.S.uWe apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side- chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.t'Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA. Mark.Gerstein@yale.edue 0009521122 Protein Sci) 19987 2h 445-56 8?b\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0007990952s0*Orengo, C. A. Jones, D. T. Thornton, J. M.2+Protein superfamilies and domain superfoldsAlgorithms Databases, Factual Protein Conformation *Protein Folding Proteins/*chemistry/*classification Sequence Alignment Structure-Activity RelationshiphbAs the protein sequence and structure databases expand rapidly a better understanding of the relationships between proteins is required. A classification is considered that extends the sequence-based superfamilies to include proteins with similar function and three- dimensional structures but no sequence similarity. So far there are only nine protein folds known to recur in proteins having neither sequence nor functional similarity. These folds dominate the structure database, representing more than 30 per cent of all determined structures. This observation has implications for protein-fold recognition.'TMBiochemistry and Molecular Biology Department, University College London, UK. 0007990952 Nature 1994 372 6507 631-4ZSOrengo, C. A. Michie, A. D. Jones, S. Jones, D. T. Swindells, M. B. Thornton, J. M.SD>CATH--a hierarchic classification of protein domain structuresDatabases, Factual Models, Molecular Protein Folding Protein Structure, Secondary *Protein Structure, Tertiary Proteins/*chemistry/*classification Sequence Homology, Amino Acid Support, Non-U.S. Gov'tBACKGROUND: Protein evolution gives rise to families of structurally related proteins, within which sequence identities can be extremely low. As a result, structure-based classifications can be effective at identifying unanticipated relationships in known structures and in optimal cases function can also be assigned. The ever increasing number of known protein structures is too large to classify all proteins manually, therefore, automatic methods are needed for fast evaluation of protein structures. RESULTS: We present a semi-automatic procedure for deriving a novel hierarchical classification of protein domain structures (CATH). The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is the simplest level, and it essentially describes the secondary structure composition of each domain. In contrast, architecture summarises the shape revealed by the orientations of the secondary structure units, such as barrels and sandwiches. At the topology level, sequential connectivity is considered, such that members of the same architecture might have quite different topologies. When structures belonging to the same T-level have suitably high similarities combined with similar functions, the proteins are assumed to be evolutionarily related and put into the same homologous superfamily. CONCLUSIONS: Analysis of the structural families generated by CATH reveals the prominent features of protein structure space. We find that nearly a third of the homologous superfamilies (H-levels) belong to ten major T-levels, which we call superfolds, and furthermore that nearly two-thirds of these H-levels cluster into nine simple architectures. A database of well- characterised protein structure families, such as CATH, will facilitate the assignment of structure-function/evolution relationships to both known and newly determined protein structures. Structure 1997581093-108-http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009403685,X\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010573422*D=Enright, A. J. Iliopoulos, I. Kyrpides, N. C. Ouzounis, C. A.D^XProtein interaction maps for complete genomes based on gene fusion events [see comments]Bacterial Proteins/*genetics/metabolism/physiology Escherichia coli/genetics/physiology *Gene Fusion *Genome, Bacterial Haemophilus influenzae/genetics/physiology Methanococcus/genetics/physiology Protein Binding Protein Hybridization Support, Non-U.S. Gov'tA large-scale effort to measure, detect and analyse protein-protein interactions using experimental methods is under way. These include biochemistry such as co-immunoprecipitation or crosslinking, molecular biology such as the two-hybrid system or phage display, and genetics such as unlinked noncomplementing mutant detection. Using the two- hybrid system, an international effort to analyse the complete yeast genome is in progress. Evidently, all these approaches are tedious, labour intensive and inaccurate. From a computational perspective, the question is how can we predict that two proteins interact from structure or sequence alone. Here we present a method that identifies gene-fusion events in complete genomes, solely based on sequence comparison. Because there must be selective pressure for certain genes to be fused over the course of evolution, we are able to predict functional associations of proteins. We show that 215 genes or proteins in the complete genomes of Escherichia coli, Haemophilus influenzae and Methanococcus jannaschii are involved in 64 unique fusion events. The approach is general, and can be applied even to genes of unknown function.t'~wComputational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, UK. 60Comment in: Nature 1999 Nov 4;402(6757):23, 25-6 0010573422 Nature 1999 402o 6757 86-90lb{J\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0011104008JCThornton, J. M. Todd, A. E. Milburn, D. Borkakoti, N. Orengo, C. A.oRLFrom structure to function: approaches and limitations [In Process Citation]This review presents a summary of current approaches to extract functional information from structural data on proteins and their complexes. While structural homologs may reveal possible biochemical functions (which may be hidden at the sequence level), elucidating the exact biological role of a protein in vivo will only be possible by including other results, such as data on expression and localization.c'haBiochemistry & Molecular Biology Dept, University College, London, UK. thornton@biochem.ucl.ac.ukr 0011104008Nat Struct Bioli 20007 Supplv 991-4f\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009252185t\VTomb, J. F. White, O. Kerlavage, A. R. Clayton, R. A. Sutton, G. G. Fleischmann, R. D. Ketchum, K. A. Klenk, H. P. Gill, S. Dougherty, B. A. Nelson, K. Quackenbush, J. Zhou, L. Kirkness, E. F. Peterson, S. Loftus, B. Richardson, D. Dodson, R. Khalak, H. G. Glodek, A. McKenney, K. Fitzegerald, L. M. Lee, N. Adams, M. D. Venter, J. C. et al.,The complete genome sequence of the gastric pathogen Helicobacter pylori [see comments] [published erratum appears in Nature 1997 Sep 25;389(6649):412]GAntigenic Variation Bacterial Adhesion Bacterial Proteins/secretion Base Sequence Cell Division DNA Repair DNA, Bacterial/genetics Evolution Gene Expression Regulation, Bacterial *Genome, Bacterial Helicobacter pylori/*genetics/metabolism/pathogenicity Hydrogen-Ion Concentration Molecular Sequence Data Recombination, Genetic Support, U.S. Gov't, P.H.S. Transcription, Genetic Translation, Genetic VirulenceHelicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified, underscoring the potential complexity of host-pathogen interaction. Based on the large number of sequence-related genes encoding outer membrane proteins and the presence of homopolymeric tracts and dinucleotide repeats in coding sequences, H. pylori, like several other mucosal pathogens, probably uses recombination and slipped-strand mispairing within repeats as mechanisms for antigenic variation and adaptive evolution. Consistent with its restricted niche, H. pylori has a few regulatory networks, and a limited metabolic repertoire and biosynthetic capacity. Its survival in acid conditions depends, in part, on its ability to establish a positive inside- membrane potential in low pH. 'VPThe Institute for Genomic Research, Rockville, Maryland 20850, USA. ghp@tigr.org4-Comment in: Nature 1997 Aug 7;388(6642):515-6  0009252185 Nature 1997 388e 6642 539-47= http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010357579 Ge\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010357579 Gerstein, M. Hegyi, H.VOComparing genomes in terms of protein structure: surveys of a finite parts list9 Amino Acid Sequence Animal Bacterial Proteins/*chemistry/*genetics Comparative Study Databases, Factual Fungal Proteins/*chemistry/*genetics *Genome, Bacterial *Genome, Fungal Genome, Human Human Molecular Sequence Data Sequence Alignment Support, Non-U.S. Gov't We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.'|vDepartment of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA. mark.gerstein@yale.edu 0010357579FEMS Microbiol Rev 1998224277-304f`B \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010329133rHegyi, H. Gerstein, M.ztThe relationship between protein structure and function: a comprehensive survey with application to the yeast genomed^Comparative Study Enzymes/chemistry Evolution, Molecular Fungal Proteins/*chemistry/genetics Genome, Fungal Models, Molecular *Protein Conformation Protein Folding Proteins/classification Saccharomyces cerevisiae/chemistry/genetics Sequence Homology, Amino Acid *Structure-Activity Relationship Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S.For most proteins in the genome databases, function is predicted via sequence comparison. In spite of the popularity of this approach, the extent to which it can be reliably applied is unknown. We address this issue by systematically investigating the relationship between protein function and structure. We focus initially on enzymes functionally classified by the Enzyme Commission (EC) and relate these to by structurally classified domains the SCOP database. We find that the major SCOP fold classes have different propensities to carry out certain broad categories of functions. For instance, alpha/beta folds are disproportionately associated with enzymes, especially transferases and hydrolases, and all-alpha and small folds with non-enzymes, while alpha+beta folds have an equal tendency either way. These observations for the database overall are largely true for specific genomes. We focus, in particular, on yeast, analyzing it with many classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and find clear tendencies for fold-function association, across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the functions of the most ancient proteins are more evenly distributed among different structural classes than those of more modern ones. For the database overall, we identify the most versatile functions, i.e. those that are associated with the most folds, and the most versatile folds, associated with the most functions. The two most versatile enzymatic functions (hydro-lyases and O-glycosyl glucosidases) are associated with seven folds each. The five most versatile folds (TIM- barrel, Rossmann, ferredoxin, alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta structures. They stand out as generic scaffolds, accommodating from six to as many as 16 functions (for the exceptional TIM-barrel). At the conclusion of our analysis we are able to construct a graph giving the chance that a functional annotation can be reliably transferred at different degrees of sequence and structural similarity. Supplemental information is available from http://bioinfo.mbb.yale.edu/genome/foldfunc++ +. Copyright 1999 Academic Press.'vpDepartment of Molecular Biophysics & Biochemistry Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA. 0010329133 J Mol Biol 1999 2881 147-64NHHimmelreich, R. Hilbert, H. Plagens, H. Pirkl, E. Li, B. C. Herrmann, R.VOComplete sequence analysis of the genome of the bacterium Mycoplasma pneumoniaed0)Base Sequence DNA Repair DNA Replication DNA, Bacterial/biosynthesis/*chemistry Gene Expression Regulation, Bacterial *Genome, Bacterial Molecular Sequence Data Molecular Weight Mycoplasma pneumoniae/*genetics Open Reading Frames Support, Non-U.S. Gov't Transcription, Genetic Translation, Genetic<6The entire genome of the bacterium Mycoplasma pneumoniae M129 has been sequenced. It has a size of 816,394 base pairs with an average G+C content of 40.0 mol%. We predict 677 open reading frames (ORFs) and 39 genes coding for various RNA species. Of the predicted ORFs, 75.9% showed significant similarity to genes/proteins of other organisms while only 9.9% did not reveal any significant similarity to gene sequences in databases. This permitted us tentatively to assign a functional classification to a large number of ORFs and to deduce the biochemical and physiological properties of this bacterium. The reduction of the genome size of M. pneumoniae during its reductive evolution from ancestral bacteria can be explained by the loss of complete anabolic (e.g. no amino acid synthesis) and metabolic pathways. Therefore, M. pneumoniae depends in nature on an obligate parasitic lifestyle which requires the provision of exogenous essential metabolites. All the major classes of cellular processes and metabolic pathways are briefly described. For a number of activities/functions present in M. pneumoniae according to experimental evidence, the corresponding genes could not be identified by similarity search. For instance we failed to identify genes/proteins involved in motility, chemotaxis and management of oxidative stress.'b[Zentrum fur Molekulare Biologie Heidelberg, Mikrobiologie, Universitat Heidelberg, Germany.N 0008948633http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008948633 http://www.oup.co.uk/nar/Volume_24/Issue_22/6b0244_gml.abs.htmlrNucleic Acids Resn 199624224420-49e> :http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009521122aGerstein, M. Levitt, M.eyComprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteinsjAmino Acid Sequence Automation Molecular Sequence Data Protein Conformation Proteins/*chemistry/classification *Reference Standards Sequence Alignment/methods/*standards Support, U.S. Gov't, Non-P Gerstein, M.zsHow representative are the known structures of the proteins in a complete genome? A comprehensive structural censusiAnimal Databases, Factual *Genome Human Molecular Sequence Data Peptide Library *Protein Folding Proteins/*chemistry/*genetics Sequence Analysis Support, Non-U.S. Gov'tBACKGROUND: Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation. RESULTS: The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of 'biophysical proteins' on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense. CONCLUSIONS: The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.'ztDepartment of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA. Mark.Gerstein@yale.edu 0009889159}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009889159 http://www.biomednet.com/article/fd3607 1998Fold Des36497-512 Using Smart Source ParsingA @ ,&Leipe, D. D. Aravind, L. Koonin, E. V.6/Did DNA replication evolve twice independently?CAnimal Archaeal Proteins/genetics Bacterial Proteins/genetics Cytoplasm/*genetics Databases, Factual Drosophila/genetics *DNA Replication *Evolution, Molecular Fungal Proteins/genetics Models, Genetic *Sequence Analysis, DNA Sequence Homology, Nucleic AcidDNA replication is central to all extant cellular organisms. There are substantial functional similarities between the bacterial and the archaeal/eukaryotic replication machineries, including but not limited to defined origins, replication bidirectionality, RNA primers and leading and lagging strand synthesis. However, several core components of the bacterial replication machinery are unrelated or only distantly related to the functionally equivalent components of the archaeal/eukaryotic replication apparatus. This is in sharp contrast to the principal proteins involved in transcription and translation, which are highly conserved in all divisions of life. We performed detailed sequence comparisons of the proteins that fulfill indispensable functions in DNA replication and classified them into four main categories with respect to the conservation in bacteria and archaea/eukaryotes: (i) non-homologous, such as replicative polymerases and primases; (ii) containing homologous domains but apparently non- orthologous and conceivably independently recruited to function in replication, such as the principal replicative helicases or proofreading exonucleases; (iii) apparently orthologous but poorly conserved, such as the sliding clamp proteins or DNA ligases; (iv) orthologous and highly conserved, such as clamp-loader ATPases or 5'-- >3' exonucleases (FLAP nucleases). The universal conservation of some components of the DNA replication machinery and enzymes for DNA precursor biosynthesis but not the principal DNA polymerases suggests that the last common ancestor (LCA) of all modern cellular life forms possessed DNA but did not replicate it the way extant cells do. We propose that the LCA had a genetic system that contained both RNA and DNA, with the latter being produced by reverse transcription. Consequently, the modern-type system for double-stranded DNA replication likely evolved independently in the bacterial and archaeal/eukaryotic lineages.'National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Bethesda, MD 20894, USA. 0010446225http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010446225 http://www.oup.co.uk/nar/Volume_27/Issue_17/gkc515_gml.abs.htmlNucleic Acids Res 199927173389-401<5Leipe, D. D. Aravind, L. Grishin, N. V. Koonin, E. V.oNGThe bacterial replicative helicase DnaB evolved from a RecA duplication hbAmino Acid Sequence Animal Bacterial Proteins/metabolism Computational Biology Conserved Sequence DNA Helicases/*genetics/metabolism DNA Replication/*genetics DNA, Bacterial/*metabolism Evolution, Molecular *Gene Duplication Human Molecular Sequence Data Protein Folding Rec A Protein/*genetics/metabolism Sequence Alignment Sequence Homology, Amino AcidThe RecA/Rad51/DCM1 family of ATP-dependent recombinases plays a crucial role in genetic recombination and double-stranded DNA break repair in Archaea, Bacteria, and Eukaryota. DnaB is the replication fork helicase in all Bacteria. We show here that DnaB shares significant sequence similarity with RecA and Rad51/DMC1 and two other related families of ATPases, Sms and KaiC. The conserved region spans the entire ATP- and DNA-binding domain that consists of about 250 amino acid residues and includes 7 distinct motifs. Comparison with the three- dimensional structure of Escherichia coli RecA and phage T7 DnaB (gp4) reveals that the area of sequence conservation includes the central parallel beta-sheet and most of the connecting helices and loops as well as a smaller domain that consists of a amino-terminal helix and a carboxy-terminal beta-meander. Additionally, we show that animals, plants, and the malarial Plasmodium but not Saccharomyces cerevisiae encode a previously undetected DnaB homolog that might function in the mitochondria. The DnaB homolog from Arabidopsis also contains a DnaG- primase domain and the DnaB homolog from the nematode seems to contain an inactivated version of the primase. This domain organization is reminiscent of bacteriophage primases-helicases and suggests that DnaB might have been horizontally introduced into the nuclear eukaryotic genome via a phage vector. We hypothesize that DnaB originated from a duplication of a RecA-like ancestor after the divergence of the bacteria from Archaea and eukaryotes, which indicates that the replication fork helicases in Bacteria and Archaea/Eukaryota have evolved independently.'National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda Maryland 20894 USA. 0010645945http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010645945 http://www.genome.org/cgi/content/full/10/1/5 http://www.genome.org/cgi/content/abstract/10/1/5 Genome Res 2000101 5-16U r*\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010759560@9Stawiski, E. W. Baucom, A. E. Lohr, S. C. Gregoret, L. M.ZSPredicting protein function from structure: unique structural features of proteasesnEndopeptidases/*chemistry/*metabolism Protein Structure, Secondary Structure-Activity Relationship Support, Non-U.S. Gov't Support, U.S. Gov't, P.H.S.F@We have noted consistent structural similarities among unrelated proteases. In comparison with other proteins of similar size, proteases have smaller than average surface areas, smaller radii of gyration, and higher C(alpha) densities. These findings imply that proteases are, as a group, more tightly packed than other proteins. There are also notable differences in secondary structure content between these two groups of proteins: proteases have fewer helices and more loops. We speculate that both high packing density and low alpha-helical content coevolved in proteases to avoid autolysis. By using the structural parameters that seem to show some separation between proteases and nonproteases, a neural network has been trained to predict protease function with over 86% accuracy. Moreover, it is possible to identify proteases whose folds were not represented during training. Similar structural analyses may be useful for identifying other classes of proteins and may be of great utility for categorizing the flood of structures soon to flow from structural genomics initiatives.'Graduate Program in Molecular, Cellular, and Developmental Biology, Department of Biology, University of California, Santa Cruz, CA 95064, USA. 0010759560Proc Natl Acad Sci U S A 2000978 3954-80)Subramanian, G. Koonin, E. V. Aravind, L.sleComparative genome analysis of the pathogenic spirochetes Borrelia burgdorferi and Treponema pallidumoAmino Acid Sequence Bacterial Proteins/analysis/chemistry Borrelia burgdorferi/*genetics/metabolism Comparative Study Cyclic AMP/physiology Evolution *Genome, Bacterial Molecular Sequence Data Signal Transduction Treponema pallidum/*genetics/metabolism0)A comparative analysis of the predicted protein sequences encoded in the complete genomes of Borrelia burgdorferi and Treponema pallidum provides a number of insights into evolutionary trends and adaptive strategies of the two spirochetes. A measure of orthologous relationships between gene sets, termed the orthology coefficient (OC), was developed. The overall OC value for the gene sets of the two spirochetes is about 0.43, which means that less than one-half of the genes show readily detectable orthologous relationships. This emphasizes significant divergence between the two spirochetes, apparently driven by different biological niches. Different functional categories of proteins as well as different protein families show a broad distribution of OC values, from near 1 (a perfect, one-to-one correspondence) to near 0. The proteins involved in core biological functions, such as genome replication and expression, typically show high OC values. In contrast, marked variability is seen among proteins that are involved in specific processes, such as nutrient transport, metabolism, gene-specific transcription regulation, signal transduction, and host response. Differences in the gene complements encoded in the two spirochete genomes suggest active adaptive evolution for their distinct niches. Comparative analysis of the spirochete genomes produced evidence of gene exchanges with other bacteria, archaea, and eukaryotic hosts that seem to have occurred at different points in the evolution of the spirochetes. Examples are presented of the use of sequence profile analysis to predict proteins that are likely to play a role in pathogenesis, including secreted proteins that contain specific protein-protein interaction domains, such as von Willebrand A, YWTD, TPR, and PR1, some of which hitherto have been reported only in eukaryotes. We tentatively reconstruct the likely evolutionary process that has led to the divergence of the two spirochete lineages; this reconstruction seems to point to an ancestral state resembling the symbiotic spirochetes found in insect guts.'Laboratory of Parasitic Diseases, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20894, USA. 0010678983http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010678983 http://iai.asm.org/cgi/content/full/68/3/1633 http://iai.asm.org/cgi/content/abstract/68/3/1633 Infect Immun 20006831633-48 )Jaroszewski, L. Jiwani, N. Johnston, M. Jones, D. T. Jones, S. Kallberg, Y. Kalman, S. Kaneko, T.Karmirantzou, M. Kaufman, T.Kawarabayasi, Y. Keagle, P. Keller, M. Kelley, J. M.Kerlavage, A. R.Ketchum, K. A. Khalak, H. Khalak, H. G. Khouri, H. Kikuchi, H. Kim, S. H. Kimura, T.Kirkness, E. F.Kirkpatrick, H. A. Klenk, H. P. Kolonay, J. Koonin, E. V. Kosugi, H. Kotani, H. Kreychman, J. Kudoh, Y. Kunst, F.Kurland, C. G. Kushida, N.Kyrpides, N. C. Lake, J. A. Lammel, C.Laskowski, R. A. Lathigra, R. Lee, H. Lee, N. Lee, N. H. Leipe, D. D. Lenox, A. L. Lesk, A.M. Levitt, M. Li, B. C. Lin, J Linher, K. Lipman, D. J. Lo Conte, L. Loftus, B. Lohr, S. C. Louis, E. J. Lumm, W.MacCallum, R. M. Madden, T. L.Makarova, K. S. Marathe, R.Marcotte, E. M. Martin, A. C. Matsuno, A. Matthews, B. Matthews, K. Mau, B. Mayhew, G. F. McClarty, G. McDonald, L. McKenney, K. McLeod, M. P.Merrick, J. M. Mewes, H. W. Mewes, H.-W. Michie, A. D. Milburn, D.Millburn, G. H. Miller, W.Mitchell, J. B. Mitchell, W. Miyajima, N. Moszer, I. Moult, J. Muchnik, I.Muhldorfer, I. Muller, A. Murakami, Y. Muraki, A. Murzin, A Nagai, Y. Nakamura, Y. Nakazaki, N. Nakazawa, H. Naruo, K.Naslund, A. K. Nelson, K. Nelson, K. E. Nelson, W. Norris, S. J. Ogasawara, N. Oguchi, A. Ogura, K. Ohfuku, Y. Okumura, S. Olinger, L. Oliver, S. G. Olsen, G. J. Orban, J. Orengo, C. Orengo, C. A. Otsuka, R.Ouzounis, C. A. Overbeek, R. Palmer, N. Park, J. Parkhill, J. Pawlowski, K. Pearl, F. M.Pearson, W. R.Pellegrini, M. Pennisi, E. Perna, N. T. Persson, B. Peterson, J.Peterson, J. D. Peterson, S.Philippsen, P. Pirkl, E. Plagens, H.Plunkett, G., 3rdPodowski, R. M. Poljak, R. J. Pothier, B. Qiu, D.Quackenbush, J. Read, T. D. Reeve, J. N. Reich, C. I.Richardson, D.Richardson, D. L. Riley, M.Rindone, W. P. Rivera, M. C. Rochon, Y. Rode, C. K. Rose, D. J.Russell, R. B.Russo Twombly, S.Rychlewski, L. Sakai, M. Sali, A. Salzberg, S.Salzberg, S. L. Sanchez, R. Sander, C. Saqi, M. A. Sasamoto, S. Sato, S. Sawada, M. Sayle, R. A.Schaffer, A. A. Schuster, S. Scott, J. L. Sekine, M. Shao, Y. Shapiro, L. Shen, C. Shimpo, S. Short, J. M.Sicheritz-Ponten, T. Skolnick, J. Smith, D. R. Snead, M. A. Snel, B. Sodergren, E.Sonnhammer, E. L. Spadafora, R.Stawiski, E. W.Stephens, R. S.Sternberg, M. J. Strelets, V.Subramanian, G. Sugiura, M. Sunyaev, S. Sutton, G. Sutton, G. G.Swanson, R. V.Swindells, M. B. Tabata, S. Takamiya, M. Takeuchi, C. Tanaka, A. Tanaka, T. Taroni, C.Tatusov, R. L.Teichmann, S. A. Tekaia, F. Tettelin, H.Thompson, M. J.Thornton, J. M. Todd, A. E.Tolstoshev, C. Tomb, J. F. Tschape, H. Utterback, T. Valencia, A. van Vugt, R. Venter, J. C. Vicaire, R. Wada, T. Wang, Y. Warren, P. V. Watanabe, A. Weidman, J.Weinstock, G. M.Weinstock, K. G. White, O. Whitfield, E.Wierzbowski, J. Wilson, C. A. Wilson, R. K.Winkler, H. H. Woese, C. R. Wolf, Y. I. Wood, T. Yamada, M. Yamamoto, S. Yamazaki, J. Yasuda, M. Yeates, T. O. Young, W. G. Yuan, Y. Zhang, B. Zhang, J. Zhang, Z. Zhou, L.Zomorodipour, A.uF>ZXR Das M Gerstein 2000^XThe Stability of Thermophilic Proteins: A Study Based on Comprehensive Genome Comparison(!Functional & Integrative Genomics1 33-45A\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009537320rDeckert, G. Warren, P. V. Gaasterland, T. Young, W. G. Lenox, A. L. Graham, D. E. Overbeek, R. Snead, M. A. Keller, M. Aujay, M. Huber, R. Feldman, R. A. Short, J. M. Olsen, G. J. Swanson, R. V.NGThe complete genome of the hyperthermophilic bacterium Aquifex aeolicuslJCChromosome Mapping Chromosomes, Bacterial Citric Acid Cycle DNA Repair DNA, Bacterial/biosynthesis/genetics *Genome, Bacterial Gram-Negative Aerobic Rods and Cocci/*genetics/metabolism Molecular Sequence Data Oxidative Stress Phylogeny Support, U.S. Gov't, Non-P.H.S. Temperature Transcription, Genetic Translation, GeneticAquifex aeolicus was one of the earliest diverging, and is one of the most thermophilic, bacteria known. It can grow on hydrogen, oxygen, carbon dioxide, and mineral salts. The complex metabolic machinery needed for A. aeolicus to function as a chemolithoautotroph (an organism which uses an inorganic carbon source for biosynthesis and an inorganic chemical energy source) is encoded within a genome that is only one-third the size of the E. coli genome. Metabolic flexibility seems to be reduced as a result of the limited genome size. The use of oxygen (albeit at very low concentrations) as an electron acceptor is allowed by the presence of a complex respiratory apparatus. Although this organism grows at 95 degrees C, the extreme thermal limit of the Bacteria, only a few specific indications of thermophily are apparent from the genome. Here we describe the complete genome sequence of 1,551,335 base pairs of this evolutionarily and physiologically interesting organism.'<6Diversa Corporation, San Diego, California 92121, USA. 0009537320 Nature 1998 392 6674 353-8\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009381177U,&DeRisi, J. L. Iyer, V. R. Brown, P. O.XQExploring the metabolic and genetic control of gene expression on a genomic scale @9Citric Acid Cycle Culture Media DNA-Binding Proteins/genetics/metabolism Fermentation Fungal Proteins/genetics/metabolism *Gene Expression Regulation, Fungal Genes, Fungal Genes, Regulator *Genome, Fungal Gluconeogenesis Glucose/metabolism Glyoxylates/metabolism Open Reading Frames Oxygen Consumption Repressor Proteins/genetics/metabolism RNA, Fungal/genetics/metabolism RNA, Messenger/genetics/metabolism Saccharomyces cerevisiae/growth & development/*genetics/*metabolism Support, Non-U.S. Gov't Support, U.S. Gov't, P.H.S. Transcription Factors/genetics/metabolismLEDNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used to carry out a comprehensive investigation of the temporal program of gene expression accompanying the metabolic shift from fermentation to respiration. The expression profiles observed for genes with known metabolic functions pointed to features of the metabolic reprogramming that occur during the diauxic shift, and the expression patterns of many previously uncharacterized genes provided clues to their possible functions. The same DNA microarrays were also used to identify genes whose expression was affected by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcriptional activator YAP1. These results demonstrate the feasibility and utility of this approach to genomewide exploration of gene expression patterns.s'Department of Biochemistry, Stanford University School of Medicine, Howard Hughes Medical Institute, Stanford, CA 94305-5428, USA. 0009381177Sciencei 1997 278p 5338 680-6 \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010944397 Devos, D. Valencia, A..'Practical limits of function predictionAmino Acid Sequence Binding Sites Models, Molecular Molecular Sequence Data Protein Conformation Proteins/*chemistry Support, Non-U.S. Gov'tThe widening gap between known protein sequences and their functions has led to the practice of assigning a potential function to a protein on the basis of sequence similarity to proteins whose function has been experimentally investigated. We present here a critical view of the theoretical and practical bases for this approach. The results obtained by analyzing a significant number of true sequence similarities, derived directly from structural alignments, point to the complexity of function prediction. Different aspects of protein function, including (i) enzymatic function classification, (ii) functional annotations in the form of key words, (iii) classes of cellular function, and (iv) conservation of binding sites can only be reliably transferred between similar sequences to a modest degree. The reason for this difficulty is a combination of the unavoidable database inaccuracies and the plasticity of protein function. In addition, analysis of the relationship between sequence and functional descriptions defines an empirical limit for pairwise-based functional annotations, namely, the three first digits of the six numbers used as descriptors of protein folds in the FSSP database can be predicted at an average level as low as 7.5% sequence identity, two of the four EC digits at 15% identity, half of the SWISS-PROT key words related to protein function would require 20% identity, and the prediction of half of the residues in the binding site can be made at the 30% sequence identity level.r'4.Protein Design Group, CNB-CSIC, Madrid, Spain. 0010944397Proteins 2000411t 98-107 * m$ Phosphoglycerate Kinase/geneticsrPhotosynthesisram Phylogeny Physical Chromosome Mappingio PlasmidsrPlasmids/geneticsPolymerase Chain Reaction$ Prions/classification/physiologye Probabilityei$!Prokaryotic Cells/*classificationProtein Bindingc Protein ConformationaProtein FoldingarProtein Hybridization Protein Structure, Secondary Protein Structure, TertiaryytProteins/*chemistryna(#Proteins/*chemistry/*classification Proteins/*chemistry/*genetics("Proteins/*chemistry/classification0,Proteins/*chemistry/classification/*genetics4.Proteins/*chemistry/classification/*metabolism4.Proteins/*chemistry/classification/*physiology<7Proteins/*chemistry/classification/genetics/*metabolism Proteins/*chemistry/genetics/,(Proteins/*chemistry/genetics/*metabolismProteins/*geneticss/g$Proteins/*genetics/*physiologyysi4.Proteins/chemistry/*classification/*metabolismmet,'Proteins/chemistry/*genetics/metabolismProteins/classificationtrProteins/geneticsPseudogenes/geneticsrPyrococcus/*geneticsl("Rec A Protein/*genetics/metabolismLFRecombinant Fusion Proteins/biosynthesis/chemistry/genetics/metabolismRecombination, Geneticaic$Recombination, Genetic/geneticsgy("Regulatory Sequences, Nucleic Acidics("Repetitive Sequences, Nucleic AcidicsReplication Origineti Replication Origin/geneticsti,&Repressor Proteins/genetics/metabolismsm Reproducibility of ResultsZ$Rhizobium/enzymology/*geneticscsr Ribosomal Proteins/chemistry/0-Rickettsia prowazekii/*genetics/pathogenicity$ Rickettsia/*chemistry/enzymologyeRickettsia/geneticssrRNA, Archaeal/geneticsnYRNA, BacterialsifRNA, Bacterial/geneticsucRNA, Fungal/*analysisRNA, Fungal/geneticsa$RNA, Fungal/genetics/metabolismabRNA, Helminth/geneticsppiRNA, Messenger/*analysisi RNA, Messenger/biosynthesisol("RNA, Messenger/genetics/metabolismlisRNA, RibosomalsifRNA, Ribosomal/geneticsYRNA, Transfer/geneticssuc,&RNA-Directed DNA Polymerase/metabolism/phRNA/metabolismuenSaccharomyces cerevisiaes("Saccharomyces cerevisiae/*genetics0-Saccharomyces cerevisiae/*genetics/metabolism0+Saccharomyces cerevisiae/chemistry/geneticsHCSaccharomyces cerevisiae/growth & development/*genetics/*metabolismSaccharomyces/*geneticsY Sensitivity and SpecificitysoSequence Alignmentnet Sequence Alignment/*methodss<7Sequence Alignment/*methods/statistics & numerical data Sequence Alignment/methodsary(%Sequence Alignment/methods/*standardsSequence AnalysisSequence Analysis, DNAera,'Sequence Analysis, DNA/*history/methodsti Sequence Analysis/*methodscsSequence DeletionSequence Homology Sequence Homology, Amino Acid$Sequence Homology, Nucleic AcidSignal TransductionDa Software Species Specificityru$Structure-Activity RelationshipboSupport, Non-U.S. Gov'to $Support, U.S. Gov't, Non-P.H.S.ye Support, U.S. Gov't, P.H.S.Z83Synechocystis Group/enzymology/*genetics/physiologySystems Integration*g Telomere  TemperatureS.,)Transcription Factors/genetics/metabolism$ Transcription Factors/metabolismbTranscription, Geneticon-Translation, Genetici,'Treponema pallidum/*genetics/metabolismsm85Treponema pallidum/*genetics/metabolism/pathogenicityTryptophan/biosynthesiso Tuberculosis/microbiologyVariation (Genetics) 82Vibrio cholerae/enzymology/*genetics/pathogenicity VirulenceVirulence/geneticsyru Virulence/genetics/physiology Yeasts/genetics/metabolismsmcq2Shapiro, L. Harris, T.2,Finding function through structural genomicsAmino Acid Sequence Animal Biotechnology *Computational Biology Conserved Sequence *Genome Human Molecular Sequence Data Protein Conformation Proteins/*chemistry/genetics/*metabolism Sequence Alignment Structure-Activity RelationshipMztThe recent availability of whole-genome sequences and large numbers of protein-coding regions from high-throughput cDNA analysis has fundamentally changed experimental biology. These efforts have provided huge databases of protein sequences, many of which are of unknown function. Deciphering the functions of these myriad proteins presents a major intellectual challenge.'Structural Biology Program, Department of Physiology and Biophysics, Mount Sinai School of Medicine, New York, NY 10029, USA. shapiro@anguilla.physbio.mssm.edu 0010679341}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010679341 http://www.biomednet.com/article/btb107Curr Opin Biotechnol 2000111 31-5\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009371463 :3Smith, D. R. Doucette-Stamm, L. A. Deloughery, C. Lee, H. Dubois, J. Aldredge, T. Bashirzadeh, R. Blakely, D. Cook, R. Gilbert, K. Harrison, D. Hoang, L. Keagle, P. Lumm, W. Pothier, B. Qiu, D. Spadafora, R. Vicaire, R. Wang, Y. Wierzbowski, J. Gibson, R. Jiwani, N. Caruso, A. Bush, D. Reeve, J. N. et al.,.|uComplete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomicsCB50% identical to M. jannaschii polypeptides, and there is little conservation in the relative locations of orthologous genes. When the M. thermoautotrophicum ORFs are compared to sequences from only the eucaryal and bacterial domains, 786 (42%) are more similar to bacterial sequences and 241 (13%) are more similar to eucaryal sequences. The bacterial domain-like gene products include the majority of those predicted to be involved in cofactor and small molecule biosyntheses, intermediary metabolism, transport, nitrogen fixation, regulatory functions, and interactions with the environment. Most proteins predicted to be involved in DNA metabolism, transcription, and translation are more similar to eucaryal sequences. Gene structure and organization have features that are typical of the Bacteria, including genes that encode polypeptides closely related to eucaryal proteins. There are 24 polypeptides that could form two-component sensor kinase-response regulator systems and homologs of the bacterial Hsp70-response proteins DnaK and DnaJ, which are notably absent in M. jannaschii. DNA replication initiation and chromosome packaging in M. thermoautotrophicum are predicted to have eucaryal features, based on the presence of two Cdc6 homologs and three histones; however, the presence of an ftsZ gene indicates a bacterial type of cell division initiation. The DNA polymerases include an X- family repair type and an unusual archaeal B type formed by two separate polypeptides. The DNA-dependent RNA polymerase (RNAP) subunits A', A", B', B" and H are encoded in a typical archaeal RNAP operon, although a second A' subunit-encoding gene is present at a remote location. There are two rRNA operons, and 39 tRNA genes are dispersed around the genome, although most of these occur in clusters. Three of the tRNA genes have introns, including the tRNAPro (GGG) gene, which contains a second intron at an unprecedented location. There is no selenocysteinyl-tRNA gene nor evidence for classically organized IS elements, prophages, or plasmids. The genome contains one intein and two extended repeats (3.6 and 8.6 kb) that are members of a family with 18 representatives in the M. jannaschii genome.'~Genome Therapeutics Corporation, Collaborative Research Division, Waltham, Massachusetts 02154, USA. doug.smith@genomecorp.com 0009371463 J Bacteriol 1997 179227135-55k]`g&a\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0007542800 Fleischmann, R. D. Adams, M. D. White, O. Clayton, R. A. Kirkness, E. F. Kerlavage, A. R. Bult, C. J. Tomb, J. F. Dougherty, B. A. Merrick, J. M. et al.,n^WWhole-genome random sequencing and assembly of Haemophilus influenzae Rd [see comments]lBacterial Proteins/genetics Base Composition Base Sequence *Chromosome Mapping/methods Chromosomes, Bacterial Cloning, Molecular Costs and Cost Analysis Databases, Factual DNA, Bacterial/*genetics Genes, Bacterial *Genome, Bacterial Haemophilus influenzae/*genetics/physiology Molecular Sequence Data Operon Repetitive Sequences, Nucleic Acid RNA, Bacterial/genetics RNA, Ribosomal/genetics *Sequence Analysis, DNA/methods Software Support, Non-U.S. Gov'tVOAn approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism.'LFJohns Hopkins University School of Medicine, Baltimore, MD 21205, USA.Comment in: Science 1995 Jul 28;269(5223):468-70 Comment in: Science 1995 Sep 29;269(5232):1805 Comment in: Science 1996 Mar 1;271(5253):1302; discussion 1303-4 Comment in: Science 1996 Mar 1;271(5253):1302-3; discussion 1303-4o 0007542800Sciences 1995 269n 5223496-512v\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0007569993eFraser, C. M. Gocayne, J. D. White, O. Adams, M. D. Clayton, R. A. Fleischmann, R. D. Bult, C. J. Kerlavage, A. R. Sutton, G. Kelley, J. M. et al.,iJCThe minimal gene complement of Mycoplasma genitalium [see comments]nAntigenic Variation/genetics Bacterial Proteins/genetics Biological Transport/genetics Databases, Factual DNA Repair/genetics DNA Replication/genetics DNA, Bacterial/genetics Energy Metabolism/genetics Genes, Bacterial *Genome, Bacterial Haemophilus influenzae/genetics Molecular Sequence Data Mycoplasma/*genetics/immunology/metabolism Open Reading Frames *Sequence Analysis, DNA Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S. Transcription, Genetic Translation, GeneticpiThe complete nucleotide sequence (580,070 base pairs) of the Mycoplasma genitalium genome, the smallest known genome of any free-living organism, has been determined by whole-genome random sequencing and assembly. A total of only 470 predicted coding regions were identified that include genes required for DNA replication, transcription and translation, DNA repair, cellular transport, and energy metabolism. Comparison of this genome to that of Haemophilus influenzae suggests that differences in genome content are reflected as profound differences in physiology and metabolic capacity between these two organisms.'@9Institute for Genomic Research, Rockville, MD 20850, USA.Comment in: Science 1995 Oct 20;270(5235):445-6 Comment in: Science 1996 Mar 1;271(5253):1302; discussion 1303-4 Comment in: Science 1996 Mar 1;271(5253):1302-3; discussion 1303-4 Comment in: Science 1996 May 3;272(5262):745-6 0007569993Sciencen 1995 270. 5235397-403h\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009403685,XRFraser, C. M. Casjens, S. Huang, W. M. Sutton, G. G. Clayton, R. Lathigra, R. White, O. Ketchum, K. A. Dodson, R. Hickey, E. K. Gwinn, M. Dougherty, B. Tomb, J. F. Fleischmann, R. D. Richardson, D. Peterson, J. Kerlavage, A. R. Quackenbush, J. Salzberg, S. Hanson, M. van Vugt, R. Palmer, N. Adams, M. D. Gocayne, J. Venter, J. C. et al.,ZSGenomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi [see comments]oBiological Transport Borrelia burgdorferi/*genetics Chemotaxis Chromosomes, Bacterial DNA Repair DNA, Bacterial/biosynthesis/genetics Energy Metabolism Gene Expression Regulation, Bacterial *Genome, Bacterial Lyme Disease/microbiology Membrane Proteins/genetics Molecular Sequence Data Plasmids Recombination, Genetic Replication Origin Support, Non-U.S. Gov't Telomere Transcription, Genetic Translation, GeneticThe genome of the bacterium Borrelia burgdorferi B31, the aetiologic agent of Lyme disease, contains a linear chromosome of 910,725 base pairs and at least 17 linear and circular plasmids with a combined size of more than 533,000 base pairs. The chromosome contains 853 genes encoding a basic set of proteins for DNA replication, transcription, translation, solute transport and energy metabolism, but, like Mycoplasma genitalium, it contains no genes for cellular biosynthetic reactions. Because B. burgdorferi and M. genitalium are distantly related eubacteria, we suggest that their limited metabolic capacities reflect convergent evolution by gene loss from more metabolically competent progenitors. Of 430 genes on 11 plasmids, most have no known biological function; 39% of plasmid genes are paralogues that form 47 gene families. The biological significance of the multiple plasmid- encoded genes is not clear, although they may be involved in antigenic variation or immune evasion.y'JCThe Institute for Genomic Research, Rockville, Maryland 20850, USA. 81Comment in: Nature 1997 Dec 11;390(6660):553, 555e 0009403685 Nature 1997 390t 6660 580-6e\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009665876g\UFraser, C. M. Norris, S. J. Weinstock, G. M. White, O. Sutton, G. G. Dodson, R. Gwinn, M. Hickey, E. K. Clayton, R. Ketchum, K. A. Sodergren, E. Hardham, J. M. McLeod, M. P. Salzberg, S. Peterson, J. Khalak, H. Richardson, D. Howell, J. K. Chidambaram, M. Utterback, T. McDonald, L. Artiach, P. Bowman, C. Cotton, M. D. Venter, J. C. et al.,r\VComplete genome sequence of Treponema pallidum, the syphilis spirochete [see comments]ztBacterial Proteins/genetics/metabolism Base Sequence Borrelia burgdorferi/genetics Carrier Proteins/genetics/metabolism DNA Repair/genetics DNA Replication/genetics DNA Restriction Enzymes/genetics Energy Metabolism/genetics Genes, Bacterial Genes, Regulator *Genome, Bacterial Heat-Shock Response/genetics Lipoproteins/genetics Membrane Proteins/genetics Molecular Sequence Data Movement Open Reading Frames Oxygen Consumption/genetics Recombination, Genetic Replication Origin *Sequence Analysis, DNA Support, U.S. Gov't, P.H.S. Transcription, Genetic Translation, Genetic Treponema pallidum/*genetics/metabolism/pathogenicity0)The complete genome sequence of Treponema pallidum was determined and shown to be 1,138,006 base pairs containing 1041 predicted coding sequences (open reading frames). Systems for DNA replication, transcription, translation, and repair are intact, but catabolic and biosynthetic activities are minimized. The number of identifiable transporters is small, and no phosphoenolpyruvate:phosphotransferase carbohydrate transporters were found. Potential virulence factors include a family of 12 potential membrane proteins and several putative hemolysins. Comparison of the T. pallidum genome sequence with that of another pathogenic spirochete, Borrelia burgdorferi, the agent of Lyme disease, identified unique and common genes and substantiates the considerable diversity observed among pathogenic spirochetes.g'NGInstitute for Genomic Research, Rockville, MD 20850, USA. tpdb@tigr.orgA6/Comment in: Science 1998 Jul 17;281(5375):324-5r 0009665876Sciencei 1998 281t 5375 375-88 Frishman, D. Mewes, H.-W. 1997:4Protein structural classes in five complete genomes.Nature Struct. Biol.4u626-6281E:\>[hd\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009679203a& Kawarabayasi, Y. Sawada, M. Horikawa, H. Haikawa, Y. Hino, Y. Yamamoto, S. Sekine, M. Baba, S. Kosugi, H. Hosoyama, A. Nagai, Y. Sakai, M. Ogura, K. Otsuka, R. Nakazawa, H. Takamiya, M. Ohfuku, Y. Funahashi, T. Tanaka, T. Kudoh, Y. Yamazaki, J. Kushida, N. Oguchi, A. Aoki, K. Kikuchi, H.Complete sequence and gene organization of the genome of a hyper- thermophilic archaebacterium, Pyrococcus horikoshii OT3 (supplement)Chromosome Mapping Chromosomes, Archaeal *Genes, Archaeal *Genome Open Reading Frames Pyrococcus/*genetics RNA, Archaeal/genetics RNA, Ribosomal/genetics RNA, Transfer/genetics'ZSNational Institute of Technology and Evaluation, Tokyo, Japan. kyutaka@kazusa.or.jp 0009679203DNA Res 199852 147-55\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009389475olfKlenk, H. P. Clayton, R. A. Tomb, J. F. White, O. Nelson, K. E. Ketchum, K. A. Dodson, R. J. Gwinn, M. Hickey, E. K. Peterson, J. D. Richardson, D. L. Kerlavage, A. R. Graham, D. E. Kyrpides, N. C. Fleischmann, R. D. Quackenbush, J. Lee, N. H. Sutton, G. G. Gill, S. Kirkness, E. F. Dougherty, B. A. McKenney, K. Adams, M. D. Loftus, B. Venter, J. C. et al.,The complete genome sequence of the hyperthermophilic, sulphate- reducing archaeon Archaeoglobus fulgidus [published erratum appears in Nature 1998 Jul 2;394(6688):101]&Archaeoglobus fulgidus/*genetics/metabolism/physiology Base Sequence Cell Division DNA, Bacterial/genetics Energy Metabolism Gene Expression Regulation, Bacterial *Genes, Archaeal *Genome Molecular Sequence Data Support, U.S. Gov't, Non-P.H.S. Transcription, Genetic Translation, GeneticArchaeoglobus fulgidus is the first sulphur-metabolizing organism to have its genome sequence determined. Its genome of 2,178,400 base pairs contains 2,436 open reading frames (ORFs). The information processing systems and the biosynthetic pathways for essential components (nucleotides, amino acids and cofactors) have extensive correlation with their counterparts in the archaeon Methanococcus jannaschii. The genomes of these two Archaea indicate dramatic differences in the way these organisms sense their environment, perform regulatory and transport functions, and gain energy. In contrast to M. jannaschii, A. fulgidus has fewer restriction-modification systems, and none of its genes appears to contain inteins. A quarter (651 ORFs) of the A. fulgidus genome encodes functionally uncharacterized yet conserved proteins, two-thirds of which are shared with M. jannaschii (428 ORFs). Another quarter of the genome encodes new proteins indicating substantial archaeal gene diversity.'JCThe Institute for Genomic Research, Rockville, Maryland 20850, USA.n 0009389475 Nature 1997 390  6658 364-70\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009384377/PIKunst, F. Ogasawara, N. Moszer, I. Albertini, A. M. Alloni, G. Azevedo, V. Bertero, M. G. Bessieres, P. Bolotin, A. Borchert, S. Borriss, R. Boursier, L. Brans, A. Braun, M. Brignell, S. C. Bron, S. Brouillet, S. Bruschi, C. V. Caldwell, B. Capuano, V. Carter, N. M. Choi, S. K. Codani, J. J. Connerton, I. F. Danchin, A. et al.,b\The complete genome sequence of the gram-positive bacterium Bacillus subtilis [see comments]Bacillus subtilis/*genetics/metabolism Bacterial Proteins/genetics Cloning, Organism DNA, Bacterial *Genome, Bacterial Molecular Sequence Data Support, Non-U.S. Gov'tBacillus subtilis is the best-characterized member of the Gram-positive bacteria. Its genome of 4,214,810 base pairs comprises 4,100 protein- coding genes. Of these protein-coding genes, 53% are represented once, while a quarter of the genome corresponds to several gene families that have been greatly expanded by gene duplication, the largest family containing 77 putative ATP-binding transport proteins. In addition, a large proportion of the genetic capacity is devoted to the utilization of a variety of carbon sources, including many plant-derived molecules. The identification of five signal peptidase genes, as well as several genes for components of the secretion apparatus, is important given the capacity of Bacillus strains to secrete large amounts of industrially important enzymes. Many of the genes are involved in the synthesis of secondary metabolites, including antibiotics, that are more typically associated with Streptomyces species. The genome contains at least ten prophages or remnants of prophages, indicating that bacteriophage infection has played an important evolutionary role in horizontal gene transfer, in particular in the propagation of bacterial pathogenesis.'XRInstitut Pasteur, Unite de Biochimie Microbienne, Paris, France. fkunst@pasteur.fr4.Comment in: Nature 1997 Nov 20;390(6657):237-8 0009384377 Nature 1997 390 6657 249-56\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010206909c("Lake, J. A. Jain, R. Rivera, M. C.(!Mix and match in the tree of liferBacteria/*genetics *Evolution, Molecular Genes, rRNA Genes, Archaeal Genes, Bacterial Genes, Fungal *Genome Methanococcus/genetics Phylogeny *Recombination, Genetic RNA, Ribosomal/genetics Saccharomyces/*genetics'jdMolecular Biology Institute, University of California, Los Angeles, CA 90095, USA. lake@mbi.ucla.edu 0010206909Sciencec 1999 283u 5410 2027-8z5TT,%Teichmann, S. A. Park, J. Chothia, C.o~wStructural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements-Bacterial Proteins/*genetics *Gene Duplication *Gene Rearrangement Genes, Bacterial Mycoplasma/*genetics Support, Non-U.S. Gov'tThe parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.t'Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, United Kingdom. sat@mrc-lmb.cam.ac.ukd 0009843945http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009843945 http://www.pnas.org/cgi/content/full/95/25/14658Proc Natl Acad Sci U S A 1998952514658-630)Teichmann, S. A. Chothia, C. Gerstein, M.1&Advances in structural genomicseAnimal Computational Biology/*methods/*trends *Genome Protein Conformation Proteins/*chemistry/*genetics Support, Non-U.S. Gov'trkNew computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.'d^MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK. sat@mrc-lmb.cam.ac.uk 0010361097}http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010361097 http://www.biomednet.com/article/sb9314Curr Opin Struct Biol 199993 390-9\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=00105293499<6Thornton, J. M. Orengo, C. A. Todd, A. E. Pearl, F. M.,&Protein folds, functions and evolutionAnimal Databases, Factual Enzymes/chemistry/classification/metabolism *Evolution, Molecular Genome Human Phylogeny *Protein Folding Protein Structure, Secondary Proteins/*chemistry/classification/*metabolism Structure-Activity Relationship Support, Non-U.S. Gov'ttThe evolution of proteins and their functions is reviewed from a structural perspective in the light of the current database. Protein domain families segregate unequally between the three major classes, the 32 different architectures and almost 700 folds observed to date. We find that the number of new topologies is still increasing, although 25 new structures are now determined for each new topology. The corresponding analysis and classification of function is only just beginning, fuelled by the genome data. The structural data revealed unexpected conservations and divergence of function both within and between families. The next five years will see the compilation of a definitive dictionary of protein families and their related functions, based on structural data which reveals relationships hidden at the sequence level. Such information will provide the foundation to build a better understanding of the molecular basis of biological complexity and hopefully to facilitate rational molecular design. Copyright 1999 Academic Press.'Biochemistry and Molecular Biology Department, University College London, University of London, Gower Street, London, WC1E 6BT, UK. thornton@biochem.ucl.ac.uk 0010529349 J Mol Biol 1999 293l2i 333-42J~C SequencenYZ= PjY( $iYl,`kY PWY`= Base SequencequencenYZ= PjY( $iYl,`kY \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010357579hGerstein, M. Hegyi, H.VOComparing genomes in terms of protein structure: surveys of a finite parts list9 Amino Acid Sequence Animal Bacterial Proteins/*chemistry/*genetics Comparative Study Databases, Factual Fungal Proteins/*chemistry/*genetics *Genome, Bacterial *Genome, Fungal Genome, Human Human Molecular Sequence Data Sequence Alignment Support, Non-U.S. Gov't We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.'|vDepartment of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA. mark.gerstein@yale.edu 0010357579FEMS Microbiol Rev 1998224277-304M Gerstein J Lin H Hegyi 2000& Protein Folds in the Worm GenomePac. Symp. Biocomp. 5} 30-42. n\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009851916fGenome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium [published errata appear in Science 1999 Jan 1;283(5398):35 and 1999 Mar 26;283(5410):2103 and 1999 Sep 3;285(5433):1493]LEAnimal Caenorhabditis elegans/*genetics Chromosomes/genetics DNA, Helminth/*genetics Evolution, Molecular *Genes, Helminth *Genome Helminth Proteins/chemistry/genetics Molecular Sequence Data Physical Chromosome Mapping RNA, Helminth/genetics Repetitive Sequences, Nucleic Acid *Sequence Analysis, DNA Support, Non-U.S. Gov'tmThe 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.'The Washington University Genome Sequencing Center, Box 8501, 4444 Forest Park Parkway, St. Louis, MO 63108, USA. worm@watson.wustl.edu 0009851916Science 1998 282 5396 2012-8`ZAltschul, S. F. Madden, T. L. Schaffer, A. A. Zhang, J. Zhang, Z. Miller, W. Lipman, D. J.VPGapped BLAST and PSI-BLAST: a new generation of protein database search programsAlgorithms Amino Acid Sequence Animal *Databases, Factual DNA/*chemistry Human Molecular Sequence Data Proteins/*chemistry *Sequence Alignment *Software Support, U.S. Gov't, P.H.S.NGThe BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.Nucleic Acids Res 199725173389-402n1998a Adams1995g Adams1995e Adams1996[ Adams1997] Adams1997b Adams1997G Aebersold1999, Alberro1990\ Albertini1997cAldredge19979\ Alloni19979l Alsmark1998 Altschul1997,Anderson19900l Andersson1998l Andersson19984 Andersson19994 Andersson1999h Aoki19989MApweiler20000 Aravind19993 Aravind1999@ Aravind1999A Aravind2000U Aravind2000k Artiach1998j Asamizu1996y Ashburner1997Z Aujay1998\ Azevedo1997h Baba19989d Badcock1998M Bairoch2000w Bairoch2000+Banerjei20000i Barrell1996d Barrell1998d Barry1998 Bash19999d Basham19988c Bashirzadeh1997_ Bass20000t Bates1997* Baucom20000_ Berry2000\ Bertero1997\ Bessieres1997e Blake1996c Blakely1997`Blattner1997` Bloch19972 Blum-Oehler1997\ Bolotin1997\Borchert199778 Bork199896 Bork199917 Bork19999{ Borkakoti2000\ Borriss1997\Boursier19977k Bowman19988_ Bowman20000\ Brans1997\ Braun1997p Bray19999 Brenner1995 Brenner1999\Brignell19977\ Bron19971d Brosch1998\ Brouillet1997F Brown1997d Brown1998_ Brunham2000\ Bruschi1997a Bult1995g Bult1995ne Bult1996` Burland1997c Bush19979i Bussey19966\Caldwell19977\ Capuano1997\ Carter19977c Caruso1997] Casjens1997k Chidambaram1998yChillemi1997d Chillingworth1998\ Choi19979~ Chothia1986 Chothia1992 Chothia1995T Chothia19985 Chothia1999dChurcher1998wa Clayton1995g Clayton1995e Clayton1996[ Clayton1997] Clayton1997b Clayton1997k Clayton1998\ Codani19977d Cole1998w` Collado-Vides1997\ Connerton1997d Connor1998tc Cook19979k Cotton19988_ Craven20000y Crosby19977\ Danchin19976Dandekar19997Dandekar1999X Das2000d Davies1998ti Davis1996` Davis1997^ Davis1999y de Grey1997_ DeBoy2000Z Deckert1998c Deloughery19971F DeRisi1997d Devlin1998tu Devos2000[ Dodson19977] Dodson19977b Dodson19971k Dodson19989_ Dodson200008 Doerks1998, Doolittle1990K Doolittle1999cDoucette-Stamm1997a Dougherty1995e Dougherty1996[ Dougherty1997] Dougherty1997b Dougherty1997yDrysdale19977V Dubchak1998c Dubois19971i Dujon1996d Eiglmeier1998_ Eisen2000W Eisenberg1997. Eisenberg1999/ Eisenberg1999P Eisenberg19998 Eisenhaber1998+ Eisenstein2000OElofsson1999y Emmert1997l- Enright1999lEriksson1998ta et al.19955g et al.19959[ et al.19979\ et al.19977] et al.19977b et al.1997c et al.1997d et al.19988k et al.19988^ Fan1999Z Feldman1998iFeldmann1996dFeltwell1998, Feng1990s Fetrow19999W Fischer1997P Fischer1999b Fitzegerald1997e FitzGerald19966a Fleischmann1995g Fleischmann1995e Fleischmann1996[ Fleischmann1997] Fleischmann1997b Fleischmann1997G Franza1999g Fraser1995] Fraser1997k Fraser1998_ Fraser20000Frishman1997h Funahashi1998Z Gaasterland1998iGalibert19960Galperin19991Galperin1999d Garnier1998d Gas1998y Gelbart1997d Gentles1998e Geoghagen1996Gerstein1997Gerstein1998;Gerstein1998<Gerstein1998=Gerstein1998>Gerstein1998CGerstein19985Gerstein19999BGerstein1999}Gerstein1999HGerstein2000IGerstein2000JGerstein2000XGerstein2000|Gerstein20000c Gibson1997c Gilbert1997y Gilbert1997[ Gill19979b Gill19977_ Gill20000+ Gilliland2000` Glasner1997e Glodek19966b Glodek19971g Gocayne1995e Gocayne1996] Gocayne1997s Godzik19999r Godzik20002` Goeden1997i Goffeau1996d Gordon19989[ Graham19979Z Graham1998` Gregor19977*Gregoret2000^Grimwood19990 Grishin1999A Grishin2000[ Gwinn1997] Gwinn1997k Gwinn1998_ Gwinn2000G Gygi19992 Hacker1997h Haikawa1998d Hamlin19988] Hanson19979k Hardham1998d Harris19989q Harris20000cHarrison19979= Hegyi1998C Hegyi1998B Hegyi1999} Hegyi1999J Hegyi2000_ Heidelberg2000fHerrmann19969+Herzberg20000[ Hickey19977] Hickey19977k Hickey19989_ Hickey20002f Hilbert1996f Himmelreich1996h Hino19989jHirosawa1996c Hoang1997iHoheisel1996 Holm1998d Holroyd19987dz~e`\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009278503vBlattner, F. R. Plunkett, G., 3rd Bloch, C. A. Perna, N. T. Burland, V. Riley, M. Collado-Vides, J. Glasner, J. D. Rode, C. K. Mayhew, G. F. Gregor, J. Davis, N. W. Kirkpatrick, H. A. Goeden, M. A. Rose, D. J. Mau, B. Shao, Y.TNThe complete genome sequence of Escherichia coli K-12 [comment] [see comments]&Bacterial Proteins/chemistry/genetics/metabolism Bacteriophage lambda/genetics Base Composition Binding Sites Chromosome Mapping DNA Replication DNA Transposable Elements DNA, Bacterial/genetics Escherichia coli/*genetics Genes, Bacterial *Genome, Bacterial Molecular Sequence Data Mutation Operon Recombination, Genetic Regulatory Sequences, Nucleic Acid Repetitive Sequences, Nucleic Acid RNA, Bacterial/genetics RNA, Transfer/genetics *Sequence Analysis, DNA Sequence Homology, Amino Acid Support, Non-U.S. Gov't Support, U.S. Gov't, P.H.S.The 4,639,221-base pair sequence of Escherichia coli K-12 is presented. Of 4288 protein-coding genes annotated, 38 percent have no attributed function. Comparison with five other sequenced microbes reveals ubiquitous as well as narrowly distributed gene families; many families of similar genes within E. coli are also evident. The largest family of paralogous proteins contains 80 ABC transporters. The genome as a whole is strikingly organized with respect to the local direction of replication; guanines, oligonucleotides possibly related to replication and recombination, and most genes are so oriented. The genome also contains insertion sequence (IS) elements, phage remnants, and many other patches of unusual composition indicating genome plasticity through horizontal transfer.'~xLaboratory of Genetics, University of Wisconsin-Madison, 445 Henry Mall, Madison, WI 53706, USA. ecoli@genetics.wisc.edud^Comment on: Science 1997 Sep 5;277(5331):1432-4 Comment in: Science 1998 Mar 20;279(5368):1827 0009278503Science 1997 277 53311453-74\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008688087YNHBult, C. J. White, O. Olsen, G. J. Zhou, L. Fleischmann, R. D. Sutton, G. G. Blake, J. A. FitzGerald, L. M. Clayton, R. A. Gocayne, J. D. Kerlavage, A. R. Dougherty, B. A. Tomb, J. F. Adams, M. D. Reich, C. I. Overbeek, R. Kirkness, E. F. Weinstock, K. G. Merrick, J. M. Glodek, A. Scott, J. L. Geoghagen, N. S. M. Venter, J. C.d^Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii [see comments]& Amino Acid Sequence Bacterial Proteins/chemistry/*genetics Base Composition Base Sequence Biological Transport/genetics Carbon Dioxide/metabolism Chromosome Mapping Chromosomes, Bacterial/genetics Databases, Factual DNA Replication DNA, Bacterial/*genetics Energy Metabolism/genetics Genes, Bacterial *Genome, Bacterial Hydrogen/metabolism Methane/metabolism Methanococcus/*genetics/physiology Molecular Sequence Data Sequence Analysis, DNA Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S. Transcription, Genetic Translation, GeneticThe complete 1.66-megabase pair genome sequence of an autotrophic archaeon, Methanococcus jannaschii, and its 58- and 16-kilobase pair extrachromosomal elements have been determined by whole-genome random sequencing. A total of 1738 predicted protein-coding genes were identified; however, only a minority of these (38 percent) could be assigned a putative cellular role with high confidence. Although the majority of genes related to energy production, cell division, and metabolism in M. jannaschii are most similar to those found in Bacteria, most of the genes involved in transcription, translation, and replication in M. jannaschii are more similar to those found in Eukaryotes. 'XQMicrobiology Department, University of Illinois, Champaign-Urbana, IL 61801, USA.aComment in: Science 1996 Aug 23;273(5278):1043-5 Comment in: Science 1996 Nov 8;274(5289):901; discussion 902-3 Comment in: Science 1996 Nov 8;274(5289):901-2; discussion 902-3 Comment in: Science 1997 Mar 7;275(5305):1489-90 Comment in: Science 1997 Jun 13;276(5319):1724-5 0008688087Sciencem 1996 273l 52781058-73oChothia, C. Lesk, A.M. 1986PIThe relation between the divergence of sequence and structure in proteins EMBO J. 5823-826 ADK Chothia, C 1992:4Proteins 1000 families for the molecular biologist Nature 357543-544\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009634230n@:Cole, S. T. Brosch, R. Parkhill, J. Garnier, T. Churcher, C. Harris, D. Gordon, S. V. Eiglmeier, K. Gas, S. Barry, C. E., 3rd Tekaia, F. Badcock, K. Basham, D. Brown, D. Chillingworth, T. Connor, R. Davies, R. Devlin, K. Feltwell, T. Gentles, S. Hamlin, N. Holroyd, S. Hornsby, T. Jagels, K. Barrell, B. G. et al.,Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence [see comments] [published erratum appears in Nature 1998 Nov 12;396(6707):190] Chromosome Mapping Chromosomes, Bacterial Drug Resistance, Microbial *Genome, Bacterial Human Lipids/metabolism Molecular Sequence Data Mycobacterium tuberculosis/*genetics/immunology/metabolism/pathogenicity Sequence Analysis, DNA Support, Non-U.S. Gov't Tuberculosis/microbiologyCountless millions of people have died from tuberculosis, a chronic infectious disease caused by the tubercle bacillus. The complete genome sequence of the best-characterized strain of Mycobacterium tuberculosis, H37Rv, has been determined and analysed in order to improve our understanding of the biology of this slow-growing pathogen and to help the conception of new prophylactic and therapeutic interventions. The genome comprises 4,411,529 base pairs, contains around 4,000 genes, and has a very high guanine + cytosine content that is reflected in the blased amino-acid content of the proteins. M. tuberculosis differs radically from other bacteria in that a very large portion of its coding capacity is devoted to the production of enzymes involved in lipogenesis and lipolysis, and to two new families of glycine-rich proteins with a repetitive structure that may represent a source of antigenic variation.'RKSanger Centre, Wellcome Trust Genome Campus, Hinxton, UK. stcole@pasteur.frh4.Comment in: Nature 1998 Jun 11;393(6685):515-6 0009634230 Nature 1998 393t 6685 537-44<6Dandekar, T. Schuster, S. Snel, B. Huynen, M. Bork, P.VPPathway alignment: application to the comparative analysis of glycolytic enzymesNHComparative Study Enzymes/*metabolism Glycolysis Support, Non-U.S. Gov't}Comparative analysis of metabolic pathways in different genomes yields important information on their evolution, on pharmacological targets and on biotechnological applications. In this study on glycolysis, three alternative ways of comparing biochemical pathways are combined: (1) analysis and comparison of biochemical data, (2) pathway analysis based on the concept of elementary modes, and (3) a comparative genome analysis of 17 completely sequenced genomes. The analysis reveals a surprising plasticity of the glycolytic pathway. Isoenzymes in different species are identified and compared; deviations from the textbook standard are detailed. Several potential pharmacological targets and by-passes (such as the Entner-Doudoroff pathway) to glycolysis are examined and compared in the different species. Archaean, bacterial and parasite specific adaptations are identified and described.'VPEMBL, P.O. Box 102209, D-69012 Heidelberg, Germany. dandekar@embl- heidelberg.de 0010493919http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010493919 http://www.biochemj.org/bj/343/0115/bj3430115.htm http://www.biochemj.org/bj/343/bj3430115.htm Biochem J 1999343 Pt 1 115-24j :^ \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=00101923881yKalman, S. Mitchell, W. Marathe, R. Lammel, C. Fan, J. Hyman, R. W. Olinger, L. Grimwood, J. Davis, R. W. Stephens, R. S.tD>Comparative genomes of Chlamydia pneumoniae and C. trachomatisAmino Acid Sequence Bacterial Proteins/*genetics/metabolism Chlamydia pneumoniae/*genetics/metabolism/pathogenicity Chlamydia trachomatis/*genetics/metabolism/pathogenicity Comparative Study Conserved Sequence Enzymes/genetics/metabolism *Genome, Bacterial Membrane Proteins/genetics/metabolism Molecular Sequence Data Operon Sequence Homology, Amino Acid Support, Non-U.S. Gov't Tryptophan/biosynthesisoChlamydia are obligate intracellular eubacteria that are phylogenetically separated from other bacterial divisions. C. trachomatis and C. pneumoniae are both pathogens of humans but differ in their tissue tropism and spectrum of diseases. C. pneumoniae is a newly recognized species of Chlamydia that is a natural pathogen of humans, and causes pneumonia and bronchitis. In the United States, approximately 10% of pneumonia cases and 5% of bronchitis cases are attributed to C. pneumoniae infection. Chronic disease may result following respiratory-acquired infection, such as reactive airway disease, adult-onset asthma and potentially lung cancer. In addition, C. pneumoniae infection has been associated with atherosclerosis. C. trachomatis infection causes trachoma, an ocular infection that leads to blindness, and sexually transmitted diseases such as pelvic inflammatory disease, chronic pelvic pain, ectopic pregnancy and epididymitis. Although relatively little is known about C. trachomatis biology, even less is known concerning C. pneumoniae. Comparison of the C. pneumoniae genome with the C. trachomatis genome will provide an understanding of the common biological processes required for infection and survival in mammalian cells. Genomic differences are implicated in the unique properties that differentiate the two species in disease spectrum. Analysis of the 1,230,230-nt C. pneumoniae genome revealed 214 protein-coding sequences not found in C. trachomatis, most without homologues to other known sequences. Prominent comparative findings include expansion of a novel family of 21 sequence-variant outer- membrane proteins, conservation of a type-III secretion virulence system, three serine/threonine protein kinases and a pair of parologous phospholipase-D-like proteins, additional purine and biotin biosynthetic capability, a homologue for aromatic amino acid (tryptophan) hydroxylase and the loss of tryptophan biosynthesis genes.'`ZStanford DNA Sequencing and Technology Center, Stanford University, California 94305, USA. 0010192388 Nat Genet 1999214 385-9\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008905231rKaneko, T. Sato, S. Kotani, H. Tanaka, A. Asamizu, E. Nakamura, Y. Miyajima, N. Hirosawa, M. Sugiura, M. Sasamoto, S. Kimura, T. Hosouchi, T. Matsuno, A. Muraki, A. Nakazaki, N. Naruo, K. Okumura, S. Shimpo, S. Takeuchi, C. Wada, T. Watanabe, A. Yamada, M. Yasuda, M. Tabata, S.Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regionsBacterial Proteins/*genetics DNA Nucleotidyltransferases/metabolism *Genome, Bacterial Open Reading Frames Photosynthesis Sequence Analysis, DNA Support, Non-U.S. Gov't Synechocystis Group/enzymology/*genetics/physiologyThe sequence determination of the entire genome of the Synechocystis sp. strain PCC6803 was completed. The total length of the genome finally confirmed was 3,573,470 bp, including the previously reported sequence of 1,003,450 bp from map position 64% to 92% of the genome. The entire sequence was assembled from the sequences of the physical map-based contigs of cosmid clones and of lambda clones and long PCR products which were used for gap-filling. The accuracy of the sequence was guaranteed by analysis of both strands of DNA through the entire genome. The authenticity of the assembled sequence was supported by restriction analysis of long PCR products, which were directly amplified from the genomic DNA using the assembled sequence data. To predict the potential protein-coding regions, analysis of open reading frames (ORFs), analysis by the GeneMark program and similarity search to databases were performed. As a result, a total of 3,168 potential protein genes were assigned on the genome, in which 145 (4.6%) were identical to reported genes and 1,257 (39.6%) and 340 (10.8%) showed similarity to reported and hypothetical genes, respectively. The remaining 1,426 (45.0%) had no apparent similarity to any genes in databases. Among the potential protein genes assigned, 128 were related to the genes participating in photosynthetic reactions. The sum of the sequences coding for potential protein genes occupies 87% of the genome length. By adding rRNA and tRNA genes, therefore, the genome has a very compact arrangement of protein- and RNA-coding regions. A notable feature on the gene organization of the genome was that 99 ORFs, which showed similarity to transposase genes and could be classified into 6 groups, were found spread all over the genome, and at least 26 of them appeared to remain intact. The result implies that rearrangement of the genome occurred frequently during and after establishment of this species.'2,Kazusa DNA Research Institute, Chiba, Japan. 0008905231DNA Res 199633 109-36 8 s40Bacterial Proteins/chemistry/genetics/metabolism Bacterial Proteins/geneticsZ,&Bacterial Proteins/genetics/metabolism>41Bacterial Proteins/genetics/metabolism/physiology,&Bacterial Proteins/genetics/physiology Bacterial Proteins/metabolism Bacterial Proteins/secretion Bacteriophage lambda/geneticsBacteriophages/geneticstiBase Compositionm Base Sequence Benchmarkingo Binding SitesBiological Transport Biological Transport/genetics Biotechnology$Borrelia burgdorferi/*genetics,)Borrelia burgdorferi/*genetics/metabolism Borrelia burgdorferi/genetics$ Caenorhabditis elegans/*genetics("Carbohydrates/chemistry/metabolismenlCarbon Dioxide/metabolismCarbon/metabolism Carrier Proteins/drug effects($Carrier Proteins/genetics/metabolisms Cell Division Chemotaxisurg$!Chlamydia Infections/microbiology<7Chlamydia pneumoniae/*genetics/metabolism/pathogenicityD@Chlamydia pneumoniae/enzymology/*genetics/pathogenicity/virology<8Chlamydia trachomatis/*genetics/metabolism/pathogenicityHCChlamydia trachomatis/enzymology/*genetics/metabolism/pathogenicity$Chlamydia/enzymology/*genetics/geChromosome Mapping(#Chromosome Mapping/*history/methodsChromosomes, ArchaealChromosomes, BacterialnY$Chromosomes, Bacterial/geneticsen Chromosomes, Fungal/genetics Chromosomes, Yeast ArtificialChromosomes/*geneticsChromosomes/geneticsn(#Chromosomes/genetics/ultrastructureCitric Acid Cycle Citric Acid Cycle/*geneticsmCloning, MolecularCloning, OrganismCodon Colorectal Neoplasms/etiologyComparative StudyComputational Biology,&Computational Biology/*methods/*trends($Computational Biology/classification$Computer Communication NetworksComputer SimulationConfidence IntervalscConserved Sequence/ge Conserved Sequence/geneticsio CosmidsomCosts and Cost AnalysishoCrystallizationinCrystallography, X-Raysyn Culture MediaCyclic AMP/physiologyCytoplasm/*geneticsgeDatabases, Factual$!Databases, Factual/classification("DNA Helicases/*genetics/metabolism,&DNA Nucleotidyltransferases/metabolism= DNA Repaird CDNA Repair/geneticst/DNA ReplicationinDNA Replication/*geneticsDNA Replication/geneticse$ DNA Restriction Enzymes/geneticslDNA Transposable ElementsDNA, Bacterialism("DNA, Bacterial/*analysis/*geneticslisDNA, Bacterial/*geneticso DNA, Bacterial/*metabolismtab,&DNA, Bacterial/biosynthesis/*chemistry($DNA, Bacterial/biosynthesis/genetics$!DNA, Bacterial/chemistry/geneticsDNA, Bacterial/genetics*gDNA, Fungal/geneticsoDNA, Helminth/*genetics*gDNA, Mitochondrial<8DNA-Binding Proteins/chemistry/classification/metabolism,(DNA-Binding Proteins/genetics/metabolismDNA/*chemistryctu DNA/*geneticsDNA/metabolismcheDrosophila melanogasterti$!Drosophila melanogaster/*geneticsDrosophila/geneticsge Drug Resistance, MicrobialZ(#Drug Resistance, Microbial/genetics(%Endopeptidases/*chemistry/*metabolismEnergy Metabolism Energy Metabolism/geneticseti EnglandomEnzymes/*metabolismEnzymes/chemistry84Enzymes/chemistry/classification/genetics/metabolism0+Enzymes/chemistry/classification/metabolism0+Enzymes/chemistry/classification/physiology Enzymes/chemistry/metabolismrEnzymes/drug effectsg Enzymes/genetics/metabolismti Escherichia coli/*geneticsry/(%Escherichia coli/enzymology/*geneticsEscherichia coli/genetics,'Escherichia coli/genetics/pathogenicity($Escherichia coli/genetics/physiologyi Eukaryotic Cells/*chemistryZ0 Ieshtml\20\ref\fpref_hh.enlw0whzMwȇEww]@ww,p wPWYC ("Carbohydrates/chemistry/metabolismenlw0whzMwȇEww]@wJ Lin M Gerstein 2000yWhole-Genome Trees Based on the Occurrence of Folds and Orthologs: Implications for Comparing Genomes on Different LevelsdGenome Researchi (in press)jcMakarova, K. S. Aravind, L. Galperin, M. Y. Grishin, N. V. Tatusov, R. L. Wolf, Y. I. Koonin, E. V.cComparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shelltPIAmino Acid Sequence Archaeal Proteins/genetics Bacterial Proteins/genetics Comparative Study Conserved Sequence Eukaryotic Cells/metabolism Euryarchaeota/*genetics Evolution, Molecular Genes, Archaeal/genetics *Genome Phylogeny Sequence Alignment Sequence Homology, Amino Acid Support, U.S. Gov't, Non-P.H.S. Variation (Genetics)eComparative analysis of the protein sequences encoded in the four euryarchaeal species whose genomes have been sequenced completely (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeoglobus fulgidus, and Pyrococcus horikoshii) revealed 1326 orthologous sets, of which 543 are represented in all four species. The proteins that belong to these conserved euryarchaeal families comprise 31%-35% of the gene complement and may be considered the evolutionarily stable core of the archaeal genomes. The core gene set includes the great majority of genes coding for proteins involved in genome replication and expression, but only a relatively small subset of metabolic functions. For many gene families that are conserved in all euryarchaea, previously undetected orthologs in bacteria and eukaryotes were identified. A number of euryarchaeal synapomorphies (unique shared characters) were identified; these are protein families that possess sequence signatures or domain architectures that are conserved in all euryarchaea but are not found in bacteria or eukaryotes. In addition, euryarchaea-specific expansions of several protein and domain families were detected. In terms of their apparent phylogenetic affinities, the archaeal protein families split into bacterial and eukaryotic families. The majority of the proteins that have only eukaryotic orthologs or show the greatest similarity to their eukaryotic counterparts belong to the core set. The families of euryarchaeal genes that are conserved in only two or three species constitute a relatively mobile component of the genomes whose evolution should have involved multiple events of lineage-specific gene loss and horizontal gene transfer. Frequently these proteins have detectable orthologs only in bacteria or show the greatest similarity to the bacterial homologs, which might suggest a significant role of horizontal gene transfer from bacteria in the evolution of the euryarchaeota.'National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA. 0010413400http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010413400 http://www.genome.org/cgi/content/full/9/7/608 http://www.genome.org/cgi/content/abstract/9/7/608 Genome Res 199997 608-28s \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010338021dRLZhang, B. Rychlewski, L. Pawlowski, K. Fetrow, J. S. Skolnick, J. Godzik, A.From fold predictions to function predictions: automation of functional site conservation analysis for functional genome predictions*Algorithms Amino Acid Sequence Computer Simulation Databases, Factual Models, Molecular Molecular Sequence Data *Protein Folding Structure-Activity Relationship Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S.i X QA database of functional sites for proteins with known structures, SITE, is constructed and used in conjunction with a simple pattern matching program SiteMatch to evaluate possible function conservation in a recently constructed database of fold predictions for Escherichia coli proteins (Rychlewski L et al., 1999, Protein Sci 8:614-624). In this and other prediction databases, fold predictions are based on algorithms that can recognize weak sequence similarities and putatively assign new proteins into already characterized protein families. It is not clear whether such sequence similarities arise from distant homologies or general similarity of physicochemical features along the sequence. Leaving aside the important question of nature of relations within fold superfamilies, it is possible to assess possible function conservation by looking at the pattern of conservation of crucial functional residues. SITE consists of a multilevel function description based on structure annotations and structure analyses. In particular, active site residues, ligand binding residues, and patterns of hydrophobic residues on the protein surface are used to describe different functional features. SiteMatch, a simple pattern matching program, is designed to check the conservation of residues involved in protein activity in alignments generated by any alignment method. Here, this procedure is used to study conservation of functional features in alignments between protein sequences from the E. coli genome and their optimal structural templates. The optimal templates were identified and alignments taken from the database of genomic structural predictions was described in a previous publication (Rychlewski L et al., 1999, Protein Sci 8:614-624). An automated assessment of function conservation is used to analyze the relation between fold and function similarity for a large number of fold predictions. For instance, it is shown that identifying low significance predictions with a high level of functional residue conservations can be used to extend the prediction sensitivity for fold prediction methods. Over 100 new fold/function predictions in this class were obtained in the E. coli genome. At the same time, about 30% of our previous fold predictions are not confirmed as function predictions, further highlighting the problem of function divergence in fold superfamilies.'F@The Scripps Research Institute, La Jolla, California 92037, USA. 0010338021 Protein Sci 1999851104-15m/ xrpd]Orengo, C. A. Pearl, F. M. Bray, J. E. Todd, A. E. Martin, A. C. Lo Conte, L. Thornton, J. M.sXQThe CATH Database provides insights into protein structure/function relationshipsoJDAlgorithms Amino Acid Sequence Computational Biology *Databases, Factual/trends Enzymes/chemistry/classification/physiology Evolution, Molecular Genome Internet Phylogeny Protein Folding Protein Structure, Secondary Proteins/*chemistry/classification/*physiology Sequence Homology, Amino Acid Structure-Activity RelationshipWe report the latest release (version 1.4) of the CATH protein domains database (http://www.biochem.ucl.ac.uk/bsm/cath). This is a hierarchical classification of 13 359 protein domain structures into evolutionary families and structural groupings. We currently identify 827 homologous families in which the proteins have both structual similarity and sequence and/or functional similarity. These can be further clustered into 593 fold groups and 32 distinct architectures. Using our structural classification and associated data on protein functions, stored in the database (EC identifiers, SWISS-PROT keywords and information from the Enzyme database and literature) we have been able to analyse the correlation between the 3D structure and function. More than 96% of folds in the PDB are associated with a single homologous family. However, within the superfolds, three or more different functions are observed. Considering enzyme functions, more than 95% of clearly homologous families exhibit either single or closely related functions, as demonstrated by the EC identifiers of their relatives. Our analysis supports the view that determining structures, for example as part of a 'structural genomics' initiative, will make a major contribution to interpreting genome data.e'Department of Biochemistry and Molecular Biology, Darwin Building, Univeristy College London, Gower Street, London WC1E 6BT, UK. orengo@biochem.ucl.ac.ukn 0009847200http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009847200 http://www.oup.co.uk/nar/Volume_27/Issue_01/gkc089_gml.abs.htmlaNucleic Acids Ress 1999271e 275-9 \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010902155>7Pawlowski, K. Jaroszewski, L. Rychlewski, L. Godzik, A.GB;Sensitive sequence comparison as protein function predictorAlgorithms Bacterial Proteins/genetics/physiology Comparative Study Databases, Factual Escherichia coli/genetics Proteins/*genetics/*physiology Sensitivity and Specificity Sequence Alignment/*methods/statistics & numerical data Software Support, U.S. Gov't, P.H.S.ngProtein function assignments based on postulated homology as recognized by high sequence similarity are used routinely in genome analysis. Improvements in sensitivity of sequence comparison algorithms got to the point, that proteins with previously undetectable sequence similarity, such as for instance 10-15% of identical residues, sometimes can be classified as similar. What is the relation between such proteins? Is it possible that they are homologous? What is the practical significance of detecting such similarities? A simplified analysis of the relation between sequence similarity and function similarity is presented here for the well-characterized proteins from the E. coli genome. Using a simple measure of functional similarity based on E.C. classification of enzymes, it is shown that it correlates well with sequence similarity measured by statistical significance of the alignment score. Proteins, similar by this standard, even in cases of low sequence identity, have a much larger chance of having similar function than the randomly chosen protein pairs. Interesting exceptions to these rules are discussed.f'2+Burnham Institute, La Jolla, CA 92037, USA.i 0010902155Pac Symp Biocomput 2000 42-53 \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008004177WPearson, W. R.JDUsing the FASTA program to search protein and DNA sequence databases.'Algorithms Amino Acid Sequence Animal Base Sequence Comparative Study *Databases, Factual DNA/*genetics Evaluation Studies Human Molecular Sequence Data Proteins/*genetics Sensitivity and Specificity Sequence Alignment/*methods/statistics & numerical data Sequence Homology, Amino Acid *Software5'JDDepartment of Biochemistry, University of Virginia, Charlottesville. 0008004177 1994Methods Mol Biol25 365-89 Using Smart Source Parsing2,Pearson, W. R. Wood, T. Zhang, Z. Miller, W.82Comparison of DNA sequences with protein sequences *Amino Acid Sequence Animal *Base Sequence Databases, Factual DNA/*genetics Genes/genetics Genes, Bacterial/genetics Mice Molecular Sequence Data Open Reading Frames/genetics Proteins/*genetics Sequence Alignment/*methods *Software Support, U.S. Gov't, P.H.S.^XThe FASTA package of sequence comparison programs has been expanded to include FASTX and FASTY, which compare a DNA sequence to a protein sequence database, translating the DNA sequence in three frames and aligning the translated DNA sequence to each sequence in the protein database, allowing gaps and frameshifts. Also new are TFASTX and TFASTY, which compare a protein sequence to a DNA sequence database, translating each sequence in the DNA database in six frames and scoring alignments with gaps and frameshifts. FASTX and TFASTX allow only frameshifts between codons, while FASTY and TFASTY allow substitutions or frameshifts within a codon. We examined the performance of FASTX and FASTY using different gap-opening, gap-extension, frameshift, and nucleotide substitution penalties. In general, FASTX and FASTY perform equivalently when query sequences contain 0-10% errors. We also evaluated the statistical estimates reported by FASTX and FASTY. These estimates are quite accurate, except when an out-of-frame translation produces a low-complexity protein sequence. We used FASTX to scan the Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii genomes for unidentified or misidentified protein-coding genes. We found at least 9 new protein-coding genes in the three genomes and at least 35 genes with potentially incorrect boundaries.Genomics 1997461] 24-36PJPellegrini, M. Marcotte, E. M. Thompson, M. J. Eisenberg, D. Yeates, T. O.`YAssigning protein functions by comparative genome analysis: protein phylogenetic profiles JCBacterial Proteins/chemistry/genetics Comparative Study Escherichia coli/*genetics *Evolution, Molecular *Genome *Genome, Bacterial Models, Biological Open Reading Frames *Phylogeny Proteins/*chemistry/genetics Ribosomal Proteins/chemistry Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S.eDetermining protein functions from genomic sequences is a central goal of bioinformatics. We present a method based on the assumption that proteins that function together in a pathway or structural complex are likely to evolve in a correlated fashion. During evolution, all such functionally linked proteins tend to be either preserved or eliminated in a new species. We describe this property of correlated evolution by characterizing each protein by its phylogenetic profile, a string that encodes the presence or absence of a protein in every known genome. We show that proteins having matching or similar profiles strongly tend to be functionally linked. This method of phylogenetic profiling allows us to predict the function of uncharacterized proteins.'Molecular Biology Institute and Departments of Energy Laboratory of Structural Biology and Molecular Medicine, University of California, Los Angeles, CA 90095-1570, USA.h 0010200254http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010200254 http://www.pnas.org/cgi/content/full/96/8/4285Proc Natl Acad Sci U S A 1999968e 4285-8SQt f_ .0*Read, T. D. Brunham, R. C. Shen, C. Gill, S. R. Heidelberg, J. F. White, O. Hickey, E. K. Peterson, J. Utterback, T. Berry, K. Bass, S. Linher, K. Weidman, J. Khouri, H. Craven, B. Bowman, C. Dodson, R. Gwinn, M. Nelson, W. DeBoy, R. Kolonay, J. McClarty, G. Salzberg, S. L. Eisen, J. Fraser, C. M.RLGenome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39PIAnimal Bacterial Proteins/genetics Bacteriophages/genetics Base Sequence Chlamydia pneumoniae/enzymology/*genetics/pathogenicity/virology Chlamydia trachomatis/enzymology/*genetics/metabolism/pathogenicity Chlamydia Infections/microbiology Comparative Study Conserved Sequence/genetics Evolution, Molecular Genes, Bacterial/genetics Genes, Duplicate/genetics *Genome, Bacterial Human Inversion (Genetics) Mice/microbiology Molecular Sequence Data Nucleotides/metabolism Physical Chromosome Mapping Recombination, Genetic/genetics Replication Origin/genetics Support, U.S. Gov't, P.H.S. The genome sequences of Chlamydia trachomatis mouse pneumonitis (MoPn) strain Nigg (1 069 412 nt) and Chlamydia pneumoniae strain AR39 (1 229 853 nt) were determined using a random shotgun strategy. The MoPn genome exhibited a general conservation of gene order and content with the previously sequenced C.trachomatis serovar D. Differences between C.trachomatis strains were focused on an approximately 50 kb 'plasticity zone' near the termination origins. In this region MoPn contained three copies of a novel gene encoding a >3000 amino acid toxin homologous to a predicted toxin from Escherichia coli 0157:H7 but had apparently lost the tryptophan biosyntheis genes found in serovar D in this region. The C. pneumoniae AR39 chromosome was >99.9% identical to the previously sequenced C.pneumoniae CWL029 genome, however, comparative analysis identified an invertible DNA segment upstream of the uridine kinase gene which was in different orientations in the two genomes. AR39 also contained a novel 4524 nt circular single-stranded (ss)DNA bacteriophage, the first time a virus has been reported infecting C. pneumoniae. Although the chlamydial genomes were highly conserved, there were intriguing differences in key nucleotide salvage pathways: C.pneumoniae has a uridine kinase gene for dUTP production, MoPn has a uracil phosphororibosyl transferase, while C.trachomatis serovar D contains neither gene. Chromosomal comparison revealed that there had been multiple large inversion events since the species divergence of C.trachomatis and C.pneumoniae, apparently oriented around the axis of the origin of replication and the termination region. The striking synteny of the Chlamydia genomes and prevalence of tandemly duplicated genes are evidence of minimal chromosome rearrangement and foreign gene uptake, presumably owing to the ecological isolation of the obligate intracellular parasites. In the absence of genetic analysis, comparative genomics will continue to provide insight into the virulence mechanisms of these important human pathogens.'^XThe Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA. 0010684935http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010684935 http://www.oup.co.uk/nar/Volume_28/Issue_06/gkd254_gml.abs.htmlNucleic Acids Res 20002861397-406\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009199410LERussell, R. B. Saqi, M. A. Sayle, R. A. Bates, P. A. Sternberg, M. J.tlfRecognition of analogous and homologous protein folds: analysis of sequence and structure conservationComputer Simulation Databases, Factual DNA Transposable Elements *Models, Molecular Mutation *Protein Folding Proteins/*chemistry/genetics Sequence Alignment Sequence Analysis/*methods Sequence Deletion *Sequence Homology, Amino AcidhAn analysis was performed on 335 pairs of structurally aligned proteins derived from the structural classification of proteins (SCOP http://scop.mrc-lmb.cam.ac.uk/scop/) database. These similarities were divided into analogues, defined as proteins with similar three- dimensional structures (same SCOP fold classification) but generally with different functions and little evidence of a common ancestor (different SCOP superfamily classification). Homologues were defined as pairs of similar structures likely to be the result of evolutionary divergence (same superfamily) and were divided into remote, medium and close sub-divisions based on the percentage sequence identity. Particular attention was paid to the differences between analogues and remote homologues, since both types of similarities are generally undetectable by sequence comparison and their detection is the aim of fold recognition methods. Distributions of sequence identities and substitution matrices suggest a higher degree of sequence similarity in remote homologues than in analogues. Matrices for remote homologues show similarity to existing mutation matrices, providing some validity for their use in previously described fold recognition methods. In contrast, matrices derived from analogous proteins show little conservation of amino acid properties beyond broad conservation of hydrophobic or polar character. Secondary structure and accessibility were more conserved on average in remote homologues than in analogues, though there was no apparent difference in the root-mean-square deviation between these two types of similarities. Alignments of remote homologues and analogues show a similar number of gaps, openings (one or more sequential gaps) and inserted/deleted secondary structure elements, and both generally contain more gaps/openings/deleted secondary structure elements than medium and close homologues. These results suggest that gap parameters for fold recognition should be more lenient than those used in sequence comparison. Parameters were derived from the analogue and remote homologue datasets for potential used in fold recognition methods. Implications for protein fold recognition and evolution are discussed.'jcBiomolecular Modelling Laboratory, Imperial Cancer Research Fund, Lincoln's Inn Fields, London, UK. 0009199410 J Mol Biol 1997 2693 423-39\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009846869lSali, A.B;100,000 protein structures for the biologist [see comments] Animal Cloning, Molecular Crystallography, X-Ray Databases, Factual *Genetics, Biochemical/economics Human Internet Nuclear Magnetic Resonance Peptide Library *Protein Conformation Proteins/geneticspiStructural genomics promises to deliver experimentally determined three- dimensional structures for many thousands of protein domains. These domains will be carefully selected, so that the methods of fold assignment and comparative protein structure modeling will result in useful models for most other protein sequences. The impact on biology will be dramatic.e'|uLaboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA. sali@rockefeller.edue82Comment in: Nat Struct Biol 1998 Dec;5(12):1019-20 0009846869Nat Struct Biolj 19985 121029-32pSanchez, R. Sali, A.TMLarge-scale protein structure modeling of the Saccharomyces cerevisiae genome Fungal Proteins/*chemistry *Genome, Fungal *Models, Molecular *Protein Conformation Saccharomyces cerevisiae/*genetics Support, U.S. Gov't, Non-P.H.S. Support, U.S. Gov't, P.H.S.*#The function of a protein generally is determined by its three- dimensional (3D) structure. Thus, it would be useful to know the 3D structure of the thousands of protein sequences that are emerging from the many genome projects. To this end, fold assignment, comparative protein structure modeling, and model evaluation were automated completely. As an illustration, the method was applied to the proteins in the Saccharomyces cerevisiae (baker's yeast) genome. It resulted in all-atom 3D models for substantial segments of 1,071 (17%) of the yeast proteins, only 40 of which have had their 3D structure determined experimentally. Of the 1,071 modeled yeast proteins, 236 were related clearly to a protein of known structure for the first time; 41 of these previously have not been characterized at all.m'rlLaboratories of Molecular Biophysics, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA. 0009811845http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009811845 http://www.pnas.org/cgi/content/full/95/23/13597Proc Natl Acad Sci U S A 19989523 13597-602w<M84l\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0009823893oAndersson, S. G. Zomorodipour, A. Andersson, J. O. Sicheritz-Ponten, T. Alsmark, U. C. Podowski, R. M. Naslund, A. K. Eriksson, A. S. Winkler, H. H. Kurland, C. G.s`ZThe genome sequence of Rickettsia prowazekii and the origin of mitochondria [see comments]DNA Replication DNA, Bacterial DNA, Mitochondrial *Evolution, Molecular *Genome, Bacterial Membrane Proteins/genetics Mitochondria/*genetics Recombination, Genetic Regulatory Sequences, Nucleic Acid Repetitive Sequences, Nucleic Acid Replication Origin Rickettsia prowazekii/*genetics/pathogenicity Support, Non-U.S. Gov't Transcription, Genetic Translation, Genetic Virulence/geneticsWe describe here the complete genome sequence (1,111,523 base pairs) of the obligate intracellular parasite Rickettsia prowazekii, the causative agent of epidemic typhus. This genome contains 834 protein- coding genes. The functional profiles of these genes show similarities to those of mitochondrial genes: no genes required for anaerobic glycolysis are found in either R. prowazekii or mitochondrial genomes, but a complete set of genes encoding components of the tricarboxylic acid cycle and the respiratory-chain complex is found in R. prowazekii. In effect, ATP production in Rickettsia is the same as that in mitochondria. Many genes involved in the biosynthesis and regulation of biosynthesis of amino acids and nucleosides in free-living bacteria are absent from R. prowazekii and mitochondria. Such genes seem to have been replaced by homologues in the nuclear (host) genome. The R. prowazekii genome contains the highest proportion of non-coding DNA (24%) detected so far in a microbial genome. Such non-coding sequences may be degraded remnants of 'neutralized' genes that await elimination from the genome. Phylogenetic analyses indicate that R. prowazekii is more closely related to mitochondria than is any other microbe studied so far.e'F?Department of Molecular Biology, University of Uppsala, Sweden.t6/Comment in: Nature 1998 Nov 12;396(6707):109-10e 0009823893 Nature 1998 396h 6707 133-40(!Andersson, J. O. Andersson, S. G.oBHow the worm was won. The C. elegans genome sequencing projectAnimal Caenorhabditis elegans/*genetics Chromosome Mapping/*history/methods Chromosomes/genetics/ultrastructure Chromosomes, Yeast Artificial Cosmids England Expressed Sequence Tags *Genes, Helminth Genetics, Biochemical/*history/organization & administration *Genome History of Medicine, 20th Cent. Human Genome Project Missouri Polymerase Chain Reaction Sequence Analysis, DNA/*history/methods Support, Non-U.S. Gov't Support, U.S. Gov't, P.H.S. The genome sequence of the free-living nematode Caenorhabditis elegans is nearly complete, with resolution of the final difficult regions expected over the next few months. This will represent the first genome of a multicellular organism to be sequenced to completion. The genome is approximately 97 Mb in total, and encodes more than 19,099 proteins, considerably more than expected before sequencing began. The sequencing project--a collaboration between the Genome Sequencing Center in St Louis and the Sanger Centre in Hinxton--has lasted eight years, with the majority of the sequence generated in the past four years. Analysis of the genome sequence is just beginning and represents an effort that will undoubtedly last more than another decade. However, some interesting findings are already apparent, indicating that the scope of the project, the approach taken, and the usefulness of having the genetic blueprint for this small organism have been well worth the effort.'haWashington University Genome Sequencing Center, St Louis, MO 63108, USA. rwilson@watson.wustl.eduu 0010098407 Trends Genet 1999152  51-8\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010704319.(Wilson, C. A. Kreychman, J. Gerstein, M.Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scorescAnimal *Computational Biology Conserved Sequence/genetics Databases, Factual Drosophila melanogaster Enzymes/chemistry/classification/genetics/metabolism *Genome Internet Molecular Weight Probability Protein Folding Protein Structure, Secondary Protein Structure, Tertiary Proteins/*chemistry/classification/genetics/*metabolism Reproducibility of Results Sequence Alignment Software Structure-Activity Relationship Support, Non-U.S. Gov't  Measuring in a quantitative, statistical sense the degree to which structural and functional information can be "transferred" between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on approximately 30,000 pairs of protein domains with known structure and function. Our domain pairs, which are constructed according to the SCOP fold classification, range in similarity from just sharing a fold, to being nearly identical. Our results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, measured in percent identity. However, as the scale of our survey is much larger than any previous investigations, our results have greater statistical weight and precision. We have been able to express the relationship of sequence and structure similarity using more "modern scores," such as Smith-Waterman alignment scores and probabilistic P- values for both sequence and structure comparison. These modern scores address some of the problems with traditional scores, such as determining a conserved core and correcting for length dependency; they enable us to phrase the sequence-structure relationship in more precise and accurate terms. We found that the basic exponential sequence- structure relationship is very general: the same essential relationship is found in the different secondary-structure classes and is evident in all the scoring schemes. To relate function to sequence and structure we assigned various levels of functional similarity to the domain pairs, based on a simple functional classification scheme. This scheme was constructed by combining and augmenting annotations in the enzyme and fly functional classifications and comparing subsets of these to the Escherichia coli and yeast classifications. We found sigmoidal relationships between similarity in function and sequence, with clear thresholds for different levels of functional conservation. For pairs of domains that share the same fold, precise function appears to be conserved down to approximately 40 % sequence identity, whereas broad functional class is conserved to approximately 25 %. Interestingly, percent identity is more effective at quantifying functional conservation than the more modern scores (e.g. P-values). Results of all the pairwise comparisons and our combined functional classification scheme for protein structures can be accessed from a web database at http://bioinfo.mbb.yale.edu/alignCopyright 2000 Academic Press.'f_Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA. 0010704319 J Mol Biol 2000 2971 233-49\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0002439888/ Woese, C. R.Bacterial evolutione*Bacteria/classification/genetics Base Sequence *Evolution Phylogeny RNA, Bacterial RNA, Ribosomal Support, U.S. Gov't, Non-P.H.S. 0002439888 Microbiol Rev- 1987512- 221-71:4Wolf, Y. I. Brenner, S. E. Bash, P. A. Koonin, E. V.F@Distribution of protein folds in the three superkingdoms of lifeAlgorithms Archaea/*chemistry Bacteria/*chemistry Computational Biology Eukaryotic Cells/*chemistry *Protein Folding *Statistical Distributionsd A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. A collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced. In the completely sequenced genomes, folds could be automatically identified for 20%-30% of the proteins, with 3%-6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes. Within the bacteria, this distribution differs between parasitic and free-living species. In all analyzed genomes, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are ferredoxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, beta-propellers and TIM-barrels. The observed diversity of protein folds in different proteomes is approximately twice as high as it would be expected from a simple stochastic model describing a proteome as a finite sample from an infinite pool of proteins with an exponential distribution of the fold fractions. Distribution of the number of domains with different folds in one protein fits the geometric model, which is compatible with the evolution of multidomain proteins by random combination of domains. [Fold predictions for proteins from 14 proteomes are available on the World Wide Web at. The FIDs are available by anonymous ftp at the same location.] Genome Res 199991 17-26,%Wolf, Y. I. Aravind, L. Koonin, E. V.oXRRickettsiae and Chlamydiae: evidence of horizontal gene transfer and gene exchangeAdenine Nucleotide Translocase/genetics Chlamydia/enzymology/*genetics Gene Transfer Genes, Bacterial Phylogeny Rickettsia/*chemistry/enzymology'National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. wolf@ncbi.nlm.nih.gov 0010322483http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010322483 http://www.biomednet.com/library/fulltext/TIG.tigs0599_01689525_v0015i05_00001704 http://www.biomednet.com/library/abstract/TIG.tigs0599_01689525_v0015i05_00001704t Trends Genet 1999155  173-5bf`} B \Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010329133rHegyi, H. Gerstein, M.ztThe relationship between protein structure and function: a comprehensive survey with application to the yeast genomed^Comparative Study Enzymes/chemistry Evolution, Molecular Fungal Proteins/*chemistry/genetics Genome, Fungal Models, Molecular *Protein Conformation Protein Folding Proteins/classification Saccharomyces cerevisiae/chemistry/genetics Sequence Homology, Amino Acid *Structure-Activity Relationship Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S.For most proteins in the genome databases, function is predicted via sequence comparison. In spite of the popularity of this approach, the extent to which it can be reliably applied is unknown. We address this issue by systematically investigating the relationship between protein function and structure. We focus initially on enzymes functionally classified by the Enzyme Commission (EC) and relate these to by structurally classified domains the SCOP database. We find that the major SCOP fold classes have different propensities to carry out certain broad categories of functions. For instance, alpha/beta folds are disproportionately associated with enzymes, especially transferases and hydrolases, and all-alpha and small folds with non-enzymes, while alpha+beta folds have an equal tendency either way. These observations for the database overall are largely true for specific genomes. We focus, in particular, on yeast, analyzing it with many classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and find clear tendencies for fold-function association, across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the functions of the most ancient proteins are more evenly distributed among different structural classes than those of more modern ones. For the database overall, we identify the most versatile functions, i.e. those that are associated with the most folds, and the most versatile folds, associated with the most functions. The two most versatile enzymatic functions (hydro-lyases and O-glycosyl glucosidases) are associated with seven folds each. The five most versatile folds (TIM- barrel, Rossmann, ferredoxin, alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta structures. They stand out as generic scaffolds, accommodating from six to as many as 16 functions (for the exceptional TIM-barrel). At the conclusion of our analysis we are able to construct a graph giving the chance that a functional annotation can be reliably transferred at different degrees of sequence and structural similarity. Supplemental information is available from http://bioinfo.mbb.yale.edu/genome/foldfunc++ +. Copyright 1999 Academic Press.'vpDepartment of Molecular Biophysics & Biochemistry Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA. 0010329133 J Mol Biol 1999 2881 147-64\Uhttp://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0010329133rHegyi, H. Gerstein, M.ztThe relationship between protein structure and function: a comprehensive survey with application to the yeast genomed^Comparative Study Enzymes/chemistry Evolution, Molecular Fungal Proteins/*chemistry/genetics Genome, Fungal Models, Molecular *Protein Conformation Protein Folding Proteins/classification Saccharomyces cerevisiae/chemistry/genetics Sequence Homology, Amino Acid *Structure-Activity Relationship Support, Non-U.S. Gov't Support, U.S. Gov't, Non-P.H.S.For most proteins in the genome databases, function is predicted via sequence comparison. In spite of the popularity of this approach, the extent to which it can be reliably applied is unknown. We address this issue by systematically investigating the relationship between protein function and structure. We focus initially on enzymes functionally classified by the Enzyme Commission (EC) and relate these to by structurally classified domains the SCOP database. We find that the major SCOP fold classes have different propensities to carry out certain broad categories of functions. For instance, alpha/beta folds are disproportionately associated with enzymes, especially transferases and hydrolases, and all-alpha and small folds with non-enzymes, while alpha+beta folds have an equal tendency either way. These observations for the database overall are largely true for specific genomes. We focus, in particular, on yeast, analyzing it with many classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and find clear tendencies for fold-function association, across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the functions of the most ancient proteins are more evenly distributed among different structural classes than those of more modern ones. For the database overall, we identify the most versatile functions, i.e. those that are associated with the most folds, and the most versatile folds, associated with the most functions. The two most versatile enzymatic functions (hydro-lyases and O-glycosyl glucosidases) are associated with seven folds each. The five most versatile folds (TIM- barrel, Rossmann, ferredoxin, alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta structures. They stand out as generic scaffolds, accommodating from six to as many as 16 functions (for the exceptional TIM-barrel). At the conclusion of our analysis we are able to construct a graph giving the chance that a functional annotation can be reliably transferred at different degrees of sequence and structural similarity. Supplemental information is available from http://bioinfo.mbb.yale.edu/genome/foldfunc++ +. Copyright 1999 Academic Press.'vpDepartment of Molecular Biophysics & Biochemistry Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA. 0010329133 J Mol Biol 1999 2881 147-64NHHimmelreich, R. Hilbert, H. Plagens, H. Pirkl, E. Li, B. C. Herrmann, R.VOComplete sequence analysis of the genome of the bacterium Mycoplasma pneumoniaed0)Base Sequence DNA Repair DNA Replication DNA, Bacterial/biosynthesis/*chemistry Gene Expression Regulation, Bacterial *Genome, Bacterial Molecular Sequence Data Molecular Weight Mycoplasma pneumoniae/*genetics Open Reading Frames Support, Non-U.S. Gov't Transcription, Genetic Translation, Genetic<6The entire genome of the bacterium Mycoplasma pneumoniae M129 has been sequenced. It has a size of 816,394 base pairs with an average G+C content of 40.0 mol%. We predict 677 open reading frames (ORFs) and 39 genes coding for various RNA species. Of the predicted ORFs, 75.9% showed significant similarity to genes/proteins of other organisms while only 9.9% did not reveal any significant similarity to gene sequences in databases. This permitted us tentatively to assign a functional classification to a large number of ORFs and to deduce the biochemical and physiological properties of this bacterium. The reduction of the genome size of M. pneumoniae during its reductive evolution from ancestral bacteria can be explained by the loss of complete anabolic (e.g. no amino acid synthesis) and metabolic pathways. Therefore, M. pneumoniae depends in nature on an obligate parasitic lifestyle which requires the provision of exogenous essential metabolites. All the major classes of cellular processes and metabolic pathways are briefly described. For a number of activities/functions present in M. pneumoniae according to experimental evidence, the corresponding genes could not be identified by similarity search. For instance we failed to identify genes/proteins involved in motility, chemotaxis and management of oxidative stress.'b[Zentrum fur Molekulare Biologie Heidelberg, Mikrobiologie, Universitat Heidelberg, Germany.N 0008948633http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=0008948633 http://www.oup.co.uk/nar/Volume_24/Issue_22/6b0244_gml.abs.htmlrNucleic Acids Resn 199624224420-49e