Nucl. Acids. Res. -- Bertone et al. 29 (13): 2884

SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics

Paul Bertone, Yuval Kluger¹, Ning Lan¹, Deyou Zheng², Dinesh Christendat³, Adelinda Yee³, Aled M. Edwards³, Cheryl H. Arrowsmith³, Gaetano T. Montelione² and Mark Gerstein¹^,*

Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA, ¹Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA, ²Center for Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, NJ 08854, USA and ³Ontario Cancer Institute and Department of Medical Biophysics, University of Toronto, Ontario M5G 2M9, Canada

Received January 5, 2001; Revised and Accepted April 23, 2001.

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

High-throughput structural proteomics is expected to generateconsiderable amounts of data on the progress of structure determinationfor many proteins. For each protein this includes informationabout cloning, expression, purification, biophysical characterizationand structure determination via NMR spectroscopy or X-ray crystallography.It will be essential to develop specifications and ontologiesfor standardizing this information to make it amenable to retrospectiveanalysis. To this end we created the SPINE database and analysissystem for the Northeast Structural Genomics Consortium. SPINE,which is available at bioinfo.mbb.yale.edu/nesg or nesg.org,is specifically designed to enable distributed scientific collaborationvia the Internet. It was designed not just as an informationrepository but as an active vehicle to standardize proteomicsdata in a form that would enable systematic data mining. Thesystem features an intuitive user interface for interactiveretrieval and modification of expression construct data, queryforms designed to track global project progress and externallinks to many other resources. Currently the database containsexperimental data on 985 constructs, of which 740 are drawnfrom Methanobacterium thermoautotrophicum, 123 from Saccharomycescerevisiae, 93 from Caenorhabditis elegans and the remainderfrom other organisms. We developed a comprehensive set of datamining features for each protein, including several relatedto experimental progress (e.g. expression level, solubilityand crystallization) and 42 based on the underlying proteinsequence (e.g. amino acid composition, secondary structure andoccurrence of low complexity regions). We demonstrate in detailthe application of a particular machine learning approach, decisiontrees, to the tasks of predicting a protein’s solubilityand propensity to crystallize based on sequence features. Weare able to extract a number of key rules from our trees, inparticular that soluble proteins tend to have significantlymore acidic residues and fewer hydrophobic stretches than insolubleones. One of the characteristics of proteomics data sets, currentlyand in the foreseeable future, is their intermediate size ( $~$ 500–5000data points). This creates a number of issues in relation toerror estimation. Initially we estimate the overall error inour trees based on standard cross-validation. However, thisleaves out a significant fraction of the data in model constructionand does not give error estimates on individual rules. Therefore,we present alternative methods to estimate the error in particularrules.

	ABSTRACT

INTRODUCTION

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

The role of computational techniques in biological researchis certain to increase with the advent of genomics. Databasesin particular have become invaluable tools in molecular biology.The current landscape of biological databases includes large,general purpose repositories for nucleotide sequences, suchas GenBank (1), DDBJ (2) and EMBL (3), and protein sequences,like PIR (4), SWISS-PROT (5) and the Protein Data Bank (6).Likewise, there are many specialized databases storing informationrelated to model organisms, such as SGD (7), MIPS (8) and FlyBase(9), comparative genomics (10,11), gene expression (NationalCenter for Biotechnology Information; ncbi.nlm.nih.gov/geo)(12), protein–protein interactions (13,14) and proteinmotions (15). The PartsList (16) system encapsulates the resultsof surveying the occurrence of folds and protein features ingenomes (17,18), while the Presage database (19) was recentlydeveloped for target selection in structural genomics. Softwaretools are also available for the creation of project-specificlaboratory information management systems (LIMS), such as LabBase(20).

	INTRODUCTION

Many biological databases are developed and maintained strictlyfor warehousing purposes, without consideration of the analysesthat may be performed on the data. Conversely, computationalstudies are often performed outside the context of informationmanagement, without a clear connection to biological reality.Our work explores a fusion of these two processes, where databasedesign is influenced by analytical requirements.

Such an undertaking requires a centralized repository to integrateand manage the data generated, coupled with strategies for subsequentcomputational analysis. By maintaining a shared infrastructureaccessible to all the participating members of a project, distributedaccess to large subsets of data is possible. This not only promotescollaborative effort among investigators by providing a commoninformation exchange platform, but also avoids costly and time-consumingduplication of experimental work. Further, data is maintainedin a consistent format across many laboratories and investigators,promoting further analysis.

To this end we have developed the SPINE database and analysissystem, an integrated approach to interactive database systemdesign and computational analysis in a distributed framework,using the recently formed Northeast Structural Genomic Consortium(www.nesg.org) as a model for multi-laboratory collaborativeresearch. The system is designed to generate standardized datafiles from user-definable subsets of the proteomics informationentered into the database, which are then used for classificationtasks. Key issues in effective data mining are introduced, withemphasis on decision trees, a supervised machine learning approach.We conclude with a discussion of prediction results for severalmacromolecular properties based on features derived from thedatabase contents.

DATABASE SYSTEM REQUIREMENTS

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

SPINE (structural proteomics in the northeast) was designedfor the Northeast Structural Genomics Consortium (NESG), a multi-institutionalcollaboration for the high-throughput determination of proteinstructures on a genomic scale, with an emphasis on model eukaryotes.The project coordinates the identification of suitable targetproteins and the production of expression constructs from whichproteins will be purified, followed by biophysical characterizationvia circular dichroism and a series of NMR or X-ray crystallographystudies to determine tertiary structures. Experimental datagenerated by this project were used for the development of adistributed data archival and analysis framework suitable forlaboratory information management, standardization of experimentalparameters and data mining techniques. Several views of thedatabase developed for this project are shown in Figure 1.

	DATABASE SYSTEM REQUIREMENTS

View larger version (68K):
[in this window]
[in a new window]

Figure 1. Global project summary (A), statistics display (B) and database home page (C). The summary table can be dynamically reconfigured to present subsets of database entries, selected based on a number of simple parameters such as the target genome the protein originates from or the institution submitting the entries. An additional parameter, labeled ‘Attribute’, is used to narrow the search to entries whose experimental progress corresponds to a particular chronological stage in the table. For example, entries can be selected with an attribute of ‘secondary structure’, which will retrieve all constructs having secondary structure data derived through various biophysical characterization methods.

A critical issue in designing a system of this kind is determiningthe fundamental ‘unit’ to be tracked by the database.In many cases this process is straightforward. For example,a database suited to classical genetics would most likely recordparameters related to the expression of individual genes undervarious experimental conditions and the function of their associatedproteins. In this case the fundamental unit of information wouldmost likely be the gene. However, in our case the solution isnot so obvious, as a variety of choices exist. A database recordcan be based on any number of entities, such as the proteinitself, a construct used to express the protein, a specificexperimental preparation or a more abstract ‘target’representation encompassing multiple proteins in a particularfamily. The most appropriate representation depends on the scopeof the project and the relative stability of the data typesunder consideration.

An obvious candidate for the fundamental database unit is theprotein. However, in certain instances homologous proteins fromother organisms prove more experimentally tractable than theactual target; this scenario would be a source of confusionwhen maintaining a resource based on proteins. An alternativeis to focus on the expression construct made for a given protein.Multiple constructs can be made for a single protein, becausea construct could be designed to express only a single domainfrom a complex protein or contain a slightly altered proteinsequence that aids in protein production and structure determination.This one-to-many relationship between target proteins and theirassociated expression constructs would imply that several databaseentries might be related to the same target. A third optionis to use the specific preparation associated with each experiment,where a database record could represent a set of conditionsby which a protein sample is prepared. An immediate concernwith this representation is that protein preparations will varyconstantly, requiring an unforeseeable number of relationaltables to accommodate their parameters.

Because multiple constructs can be generated for each target,the single protein representation is too limited for our purposes.Conversely, experimental conditions for individual protein preparationsare highly variable and it was decided that this data shouldbe compiled separately. From these candidates it was decidedthat the expression construct captured the most appropriatelevel of detail for this project. It was selected as the basicunit to be tracked by the database, essentially recording thebest experimental results for the expression, purification andcharacterization of each target protein.

To address these requirements, software components were developedfor entry and updating of expression construct records, databasesearching, bulk data retrieval and tracking the global progressof the entire project. Intuitive HTML form-based interfaceswere implemented to facilitate distributed Internet access fromparticipating laboratories. The implementation of the databasesystem is described in detail in Figure 2. The modular organizationof software components permits relatively straightforward implementationof additional functionality. This aspect is independent of theunderlying database architecture, allowing a great deal of programmingflexibility while maintaining strict compliance to the standardizeddata types established for various experimental parameters.

View larger version (26K):
[in this window]
[in a new window]

Figure 2. (A) Relationships between database system components. (B) Software module dependencies. The system was developed using the mySQL database engine for the Linux platform, in conjunction with two programming languages to facilitate low level database interaction and development of the user interface software: Perl 5.005 with the Perl Database Interface (DBI) module and the PHP 3.0 hypertext preprocessor. While syntactically similar, each language features distinct capabilities. Because the PHP interpreter is integrated as an Apache web server module, execution of PHP programs is generally faster than that of Perl-based CGI programs. This makes PHP well suited to interactive systems where timely server responses are a priority. While syntactically straightforward, the PHP language does not offer the extensive programming flexibility of Perl5. The core of the user interface system was therefore developed in PHP, while auxiliary components requiring more sophisticated functionality were implemented in Perl.

Database fields
The SPINE database fields were compiled with subsequent computationalanalysis in mind. Information having disparate formats and typeswould make data mining impossible, so an important role of thesystem is the standardization of expression construct data sets.Using a centralized data repository having a defined table structure,information is maintained in a consistent format regardlessof the investigator or laboratory where the data originates.Another benefit is the introduction of numerical values in placeof the text descriptors sometimes used by experimentalists.

To accommodate the needs of various Consortium projects wheredifferent experimental methodologies are used, principal investigatorsfrom several laboratories were involved in the process of selectingthe most appropriate information to be tracked by the system.Fields from existing data sets were used to develop a consensusof experimental parameters and this was adapted to the currentdatabase framework.

A listing of the fields used for the prototype database isshown in Table 1. The information maintained by the system initiallycomprised 63 fields for protein sequences, cloning parameters,expression level and purification yield and data derived frombiophysical characterization and structural biology experiments(oligomerization, CD, HSQC, NMR and X-ray crystallography).In addition, a number of fields are devoted to keeping trackof the laboratory and investigator responsible for working withthe target protein, dates when experiments were performed, commentsrelating to experimental conditions for each group of relatedfields and variable access levels for individual database records.The database is not intended to manage all aspects of experimentalresearch; rather, it is designed to standardize and track keyparameters related to structural proteomics. However, the systemdoes include user accounts, transaction history informationand some laboratory management tables.

View this table:
[in this window]
[in a new window]

Table 1. Listing of prototype database fields and their utility in data mining analysis

The development of this system is an ongoing project. Followingthe establishment of a prototype database, additional featureswere implemented to reflect the needs of the Consortium laboratoriesand its schema was expanded over a number of relational tables(Fig. 3). The current design allows various groups of databasefields to be accessed and updated, depending on the type ofexperiments typically performed by different investigators.Users may limit the parameters to be input for a database recordto a subset of fields that are relevant to their work by selectingforms specialized for NMR spectroscopy, X-ray crystallography,etc. By using only those fields that are applicable to a particularexperimental process, navigating through long forms with potentiallyunused elements is avoided.

View larger version (46K):
[in this window]
[in a new window]

Figure 3. Core schema for the expanded database. Relational tables capture data for target proteins, their related expression constructs and separate sets of experimental parameters for expression, purification, X-ray crystallography, NMR and biophysical characterization. Additionally, a number of features have been developed to record laboratory management and transaction information (tables not shown).

Users of the database access the system through a passwordprotected interface. The instantiation of individual user spacesaids in managing proprietary data associated with each experimentalist.This modification is beneficial in terms of designing more transparentuser interfaces. For example, investigator profiles are maintainedwhich keep track of routinely used field values and experimentalmethods, allowing the system to complete certain fields automatically.

For many experimental methods a data file is generated comprisingan entire set of results distinct from the information trackedby the main database. The inclusion of these parameters intothe existing infrastructure would be beyond the scope of thesystem; HSQC spectra, X-ray diffraction data and NMR assignmentscan span large files that would be impractical to incorporatedirectly into database tables. Instead, these are stored ona separate file server and the associated URL addresses arerecorded in construct records and linked to each record display.Thus, a key feature of the system is maintaining a central collectionof pointers to additional experimental data sets. This mechanismis, of course, extended to allow pointers into other informationrepositories associated with the project, for instance intoa crystallization database or a list of targets. We also linkthe system with other protein sequence and structure resources,such as SWISS-PROT (5), PartsList (16), GeneCensus (17,18),ProtoMap (21), SCOP (22) and CATH (23).

User interaction and dynamic content modification
The design of the system’s front end allows expressionconstruct records to be entered, edited and retrieved by individualusers without frequent intervention of a database curator. Animportant goal in this work is to design a system that worksin a practical laboratory setting, i.e. the software is operationallyrobust and straightforward, so that using it on a regular basiswill not disrupt work flow. The system provides a consistentand intuitive user interface to complex database functions,as well as error recovery features when conflicting or incompleteinformation is submitted. Search functions were developed forthe intelligent retrieval and display of information from thedatabase, as well as the ability to generate bulk dumps of largesubsets of data records and protein sequences in interchangeablefile formats, including CSV and XML.

As experimental work progresses on a given target, additionaldata is collected which may have been unavailable at the timeits expression construct record was created. Therefore, an essentialrequirement of the database is the ability to recall recordsto alter or augment their associated information. Consequently,the contents of individual database records are changing overtime in a user-mediated fashion, in contrast to more archive-orientedresources. This imposes additional sets of operational considerations,requiring provisions to ensure internal ID consistency and overwriteprotection when users enter or modify database records.

SYSTEM FUNCTIONALITY

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

Data entry and editing
The creation of expression construct records is generally performedon a per instance basis, using a series of HTML forms. Databaserecords are keyed on an identifier string that can be selectedby the investigator or generated automatically by the system.The custom identifier feature is particularly useful in caseswhere a construct for a given protein is derived from an organismdifferent from the target organism in which the protein originates.For example, the identifier HTEC5 could be used to representthe fifth Escherichia coli expression construct (EC) for a humantarget protein (H) originating from a Toronto laboratory (T).The data entry procedure is designed to be simple and intuitive.During the process of generating identifiers and creating newrecords key parameters are retained as subsequent web formsare encountered, to minimize effort and eliminate user error.This process is depicted in Figure 4.

	SYSTEM FUNCTIONALITY

View larger version (34K):
[in this window]
[in a new window]

Figure 4. Overwrite protection during the creation of new database records. The first step in creating a database record is assigning an identifier to the new entry. The identifier consists of three parts: a character to represent the target organism, a second character to indicate the institution from which the entry originates and a unique alphanumeric character string. When the entry identifier is selected the character string component may be chosen by the investigator if a proprietary nomenclature scheme is preferred; otherwise it can be automatically assigned by the system. In the latter case the unique identifier is the next available integer following the combination of target organism and institution codes. Whether the character string component is selected by the user or generated by the system, new construct identifiers are examined by the software and guaranteed not to conflict with those of existing entries, protecting against the accidental overwriting of data. Once a valid identifier has been assigned to the new database record the user may input relevant experimental parameter values using the construct entry form. Database records may be recalled and updated in two ways: by pressing the edit button available on its associated display page or by entering an expression construct identifier directly into a form accessible from the main database web interface. Once a record has been selected all of its existing field values are displayed in the construct editor, which shares a layout similar to the entry form. Users are then able to enter additional data and/or edit the current values associated with the construct and store the updated record in the database.

As new experimental data is accumulated, existing records mustbe augmented and modified. This is accomplished via editor formsidentical in layout to the data entry forms. Database attributevalues are recalled and inserted into their corresponding editorfields, where they may be modified. After changes have beenmade and any additional data have been entered, the record isupdated to reflect the new information.

Searching the database and visualizing progress
The retrieval of records from the database is accomplished throughthe use of a search engine interface (Fig. 5A), where a varietyof terms may be selected and combined with Boolean connectives.Based on the values of the elements submitted via the interfaceform, the software builds an SQL query to execute against thedatabase and returns any records matching the search terms (Fig.5B). The subset of database records returned by the search maybe optionally downloaded as a CSV formatted text file, suitablefor importing into another database or spreadsheet program.Individual expression construct records are displayed in a staticweb page (Fig. 5C), with database fields organized in a logicalhierarchy. A number of local and distributed Internet resourcesare automatically linked to record display pages, such as ProteinData Bank searching, organism-specific databases and specializedstructural annotation reports.

View larger version (77K):
[in this window]
[in a new window]

Figure 5. Database searching and record retrieval. Users can construct complex Boolean searches on a number of database key fields with an intuitive form (A); the form elements are then parsed internally and an SQL query is created based on the values of the form elements and executed against the database. The search results are then summarized in a table, displaying a user-selectable number of entries per page (B). The query terms also appear above the table in a pseudo-English format, to assist in performing effective searches. Selecting an entry from the table displays the expression construct record in a separate web page (C), which contains all the database fields associated with the record, in addition to a number of links to external resources (D).

To provide a global view of project growth, programs were developedto summarize the nature of the database holdings and illustratethe relative progress made on target proteins (Fig. 1A). Usinga subset of the main search engine functionality, users canrecall sets of database entries and display them in a largetable, organized to represent a time line in the structure determinationprocess. Advanced features allow users to reconfigure the displayto generate a custom table that presents any combination ofdatabase fields in lieu of the standard table layout.

DATA MINING APPLICATIONS FOR HIGH-THROUGHPUT PROTEOMICS

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

The success of the high-throughput aspect of structural proteomicsrelies on the optimization of target selection and experimentalprotocols. This, in turn, involves identifying proteins thatcan be readily expressed, solubilized, purified and crystallizedunder a given set of standard conditions (i.e. the most tractableinstances). These factors will strongly influence whether ornot a given protein is pursued for X-ray or NMR structure determination.One of the main goals of the SPINE system was to capture thedata in a way that made it suitable for subsequent analysis.In the following sections we present a representative application:classification of soluble proteins using decision trees. Beforepresenting the details, it is worthwhile to review some keyelements of this approach.

	DATA MINING APPLICATIONS FOR HIGH-THROUGHPUT PROTEOMICS

Machine learning concepts
The term machine learning applies to a wide range of computationalmethodologies. However, the models most suitable for our applicationsbelong to the class of algorithms that employ supervised learning.Under supervised learning the classification process consistsof two phases: training and testing. The set of all availableexamples or instances (formally termed input vectors) is dividedinto two non-intersecting sets. The first set is used to trainthe model. During this phase correct classification of the examplesis known a priori. Supervised learning strategies rely on thisinformation to adjust the performance of the model until theclassification error rate is sufficiently reduced. Learningis no longer performed after training is completed; instead,unseen instances in the test set are classified according tothe partitioning established during the training phase. Theperformance of a learning algorithm is determined by its abilityto correctly classify new instances not present in the initialtraining set.

The features of each sample can be represented as a vectorthat corresponds to a point in an n-dimensional space. Classificationis then performed by partitioning this feature space into regions,where most of the points in a region correspond to a particularcategory. The goal in training classifiers is to find an optimalpartitioning of the input space separating the highest numberof disparate examples. An ideal classifier will demonstratestrong predictive power, while explaining the relationshipsbetween the variable to be predicted and the variables comprisingthe feature space.

Machine learning applications to proteomics data
One property of a proteomics feature set that one must adhereto is the appropriate time frame in which classifications areperformed. In many cases the experimental results are seriallyrelated, constraining the composition of useful training setsto expression constructs having a priori prediction data. Forexample, one cannot expect to optimally classify crystallizationtargets if the available training set contains experimentalresults only up to the expression stage, because the availablefeature set will not contain a response variable for crystallization.Conversely, it is entirely possible to classify proteins basedon some property corresponding to an earlier experimental stage,e.g. solubility data has already been gathered for proteinshaving HSQC spectra, enabling one to train a classifier to partitionthese proteins based on solubility information.

While there are many possible issues that data mining can addressin relation to the proteomics data collected by the Consortium,we have focused on protein solubility prediction due to theimportance of this property and the availability of a largeset of Methanobacterium thermoautotrophicum expression constructrecords having solubility measurements. The size of this dataset provides the best opportunity for generalization duringtraining, increasing an algorithm’s prediction successwhen presented with new examples. An accurate prediction methodfor this property can also be an extremely useful tool, as insolubilityaccounts for almost 60% of experimentally recalcitrant proteins(24). Here we refer to solubility as ‘soluble in the cellextract’, a property that is correlated with, but notnecessarily identical to, the solubility of a purified protein.

In a supervised learning approach for solubility predictionthe training set consists of a subset of input vectors extractedfrom the database and is used by the classifier model to partitionthe sample space based on solubility, the dependent variableto be predicted. After training the feature space will be partitionedinto two regions: one containing proteins labeled as solubleand another with proteins labeled as insoluble. The second partof this approach is to determine a trained classifier’sability to generalize to unseen examples, by presenting themodel with a test set containing new feature vectors and re-evaluatingits performance.

Methanobacterium thermoautotrophicum data set
A data set comprising 562 proteins from the M.thermoautotrophicumgenome was compiled from the database and used for machine learning.Although SPINE currently holds 740 construct entries for thisorganism, 178 of these do not have solubility information andthus are not suitable for classification. As summarized in Table2, a total of 42 features were extracted from the correspondingprotein sequences, such as amino acid composition, hydrophobicity,occurrence of low complexity regions, secondary structure, etc.Combined with the database fields highlighted in Table 1, thesefeatures comprise the input vector used for the classificationstudy presented here.

View this table:
[in this window]
[in a new window]

Table 2. Protein sequence features used for solubility prediction

To identify which proteins were used for this study we constructeda ‘frozen’ version of the database at bioinfo.mbb.yale.edu/nesg/frozen.This contains the entries reported here and will not changein the future. The M.thermoautotrophicum protein expressionconstructs on which the analysis was performed are also highlightedin the frozen database.

It should be noted that prediction results for a proteomicsdata set may exhibit some degree of specificity to the expressionvectors and experimental conditions of cell growth, induction,etc. used for protein production. A characteristic of this specificM.thermoautotrophicum data is the uniform set of conditionsthat were used to prepare protein samples (26). Additionally,the experimental targets selected by the Consortium consistlargely of non-membrane proteins, so the available data setis biased in this regard.

DECISION TREE ANALYSIS

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

The selection of an appropriate learning algorithm depends onseveral factors, such as the type of data to be classified (numericor symbolic), the number of available examples in the data setand how many of the examples are likely to be noisy or inaccurate.Computational considerations, such as processing time, memorylimitations and feasibility of implementation, are also influential.Another issue is the degree of desired interpretability of theresults, which is largely determined by the representation languageused by a given algorithm. One method may exhibit advantagesin interpretation, but may generalize less optimally than another(or vice versa). The most appropriate balance between predictionsuccess and interpretation depends on which quality is moreimportant for the application. We evaluated a number of differentapproaches for this study, including neural networks, decisiontrees, support vector machines, Bayesian networks and lineardiscriminants. Here we focus on decision trees due to the relativeease of interpretability afforded by the model.

	DECISION TREE ANALYSIS

Model description
Decision tree learning (25,26) is a widely used and effectivemethod that can partition data that is not linearly separable(Fig. 6). An individual object of unknown type may be classifiedby traversing the tree. At each internal node a test is performedon the object’s value for the feature expressed at thatnode (often called the splitting variable, e.g. alanine composition).Based on this value, the appropriate branch is followed to thenext node. This procedure continues until a leaf node is reachedand the object’s classification is determined. In classifyinga given object a variable number of evaluations may be performedor omitted, depending on the path taken when the tree is traversed.In this manner a heuristic search is performed to find a compact,consistent solution that generalizes to unseen examples.

View larger version (17K):
[in this window]
[in a new window]

Figure 6. Conceptual structure of the decision tree model used for classification problems. Instances are sorted from root to leaf nodes, based on a number of properties defined at each node by splitting variables. Pictured is a decision tree built to predict the tendency for protein crystallization based on sequence features such as amino acid content, hydrophobicity and homology to other sequences. The nodes of the tree are represented by ellipses; the values to the left of each node indicate the number of proteins which are unable to crystallize, while those to the right denote the crystallized examples. The splitting threshold for each node appears directly under its associated variable. The decision tree algorithm calculates all possible splitting thresholds for each variable, selecting each variable and its threshold to optimize the homogeneity of the two subsequent nodes. When a variable v is split, the right branch is assigned to v < threshold and the left branch corresponds to v > threshold.

During training the tree is grown in two stages: (i) splittingthe nodes and (ii) pruning the tree. A common criterion forbinary node splitting entails maximizing the decrease in animpurity measure, such as residual mean deviance. The lowerthe deviance the better the tree explains the variability inthe data. A binary split for a continuous feature variable vis of the form v < threshold versus v > threshold; fora ‘descriptive’ feature a binary split divides thefeature’s value range into two classes. The size of thedecision tree necessary to classify a given set of examplesvaries according to the order in which properties are testedand growing a tree usually has the effect of overfitting thetraining set. A common strategy in most pruning algorithms isto choose the smallest tree whose error rate performance isclosest to the minimal error rate of the larger, original tree,as this is the model most likely to correctly classify unknownobjects. Pruning is particularly important with noisy data (wherethe distribution of observations from the classes overlap),as growing the tree in this case will usually overfit the trainingset.

A number of advantages are evident in the decision tree model.Classification can be based on an arbitrary mixture of symbolicand numeric variables and (for axis-parallel splitting) oneis not required to scale the variables relative to each other.The model is generally robust when presented with missing values.In addition, straightforward and concise rules can be inferredfrom the tree by following the path from root to leaf nodes.

Feature selection
We used decision trees to partition the 562 M.thermoautotrophicumproteins into soluble and insoluble classes, based on a subsetof the features listed in Table 2. The features that are relevantto a given problem domain are often unknown a priori and removingthose which are redundant or irrelevant can produce a simplermodel which generalizes better to unseen examples. Automatedfeature selection attempts to find a minimal subset of the availablefeatures, in order to either improve classification performanceor to simplify the model’s structure while preservingprediction accuracy (27). Typically, a search algorithm is usedto partition the available feature set. Classifiers are thentrained on the feature combinations presented by the searchalgorithm to identify those features which have the greatestimpact on learning. In our case we used a genetic algorithm(28) to search the space of possible feature combinations; therelevance of individual feature subsets was estimated with severalmachine learning methods, including decision trees and supportvector machines (29). We arrived at a feature subset consistingof the amino acids E, I, T and Y, combined compositions of basic(KR), acidic (DE) and aromatic (FYW) residues, the acidic residueswith their amides (DENQ), the presence of signal sequences andhydrophobic regions, secondary structure features and low complexityelements. These are highlighted in Table 2.

Decision tree results
The trees that were trained on this data set had a misclassificationrate of 12%. Decision trees built on the training set are alwaysoverly optimistic and contain a large number of nodes. Onlythe upper region of the tree is significant in terms of yieldinga generalized concept and this is the segment from which usefulrules can be derived. After training and pruning of the decisiontrees we extracted several classification rules for distinguishingbetween soluble and insoluble proteins, as described in Figure7. Two trees are shown in this example. Figure 7A illustratesthe upper five levels of a decision tree built on the entireset of 562 proteins and subjected to cross-validation. The treein Figure 7B was trained on a 375 protein subset of the dataand tested with the remaining 187.

View larger version (34K):
[in this window]
[in a new window]

Figure 7. Decision trees built for solubility prediction. Tree pruning methods are designed to reduce the number of nodes and arrive at the smallest tree whose error rate performance is closest to the minimal error rate of the entire tree. (A and B) Uppermost levels of two decision trees, highlighting paths for classification rules. The original trees from which these subsets of nodes were derived are inset to the right. Decision tree (A) was built using the entire set of 562 proteins, while (B) was trained and tested on discrete randomized subsets of the proteomics data: 375 proteins were used for training and the remaining 187 for testing. Soluble and insoluble proteins are indicated by the numbers to the right and left of each node, respectively. In the case of decision tree (B) two values are used for each class, corresponding to the training (left) and testing (right) phases. Decision pathways which terminate in highly homogeneous nodes (mostly dark, soluble; mostly white, insoluble) and are not distant from the root define more robust rules which can generalize against unseen examples. Heterogeneous nodes could be further split by extending the tree downward, improving the error rate but overfitting the training set. The pathways indicated in each decision tree represent sets of rules. For instance, the right branching path of example (A) (indicated in green) selects mostly soluble proteins, based on the condition that the combined compositions of acidic residues [C(DE)] in their sequences exceed 18%. The left branching path of the same tree (in red) outlines the following set of conditions and classifies proteins which are likely to be insoluble: C(DE) < 18%; presence of a stretch of amino acids with average hydrophobicity < –0.78 kcal/mol (labeled Hphobe); fewer than 16% acidic amino acids and their amides [C(DENQ)]. (C) Thresholds at which each node partitions the input vectors in the upper levels of the two decision trees. At each level the nodes are listed sequentially from left to right [e.g. at level 2 in tree (A) the left-most node represents the splitting variable Hphobe having a threshold of –0.78 on the GES hydrophobicity scale, followed by a node in the right-most branch of the tree corresponding to the splitting variable Length with a threshold of 95 amino acids].

Both of these exhibit simple rules for classifying solubleand insoluble proteins, illustrated by the green and red pathson either side of the root node. In the case of Figure 7A solubleproteins are selected by the right branching path of the tree,provided their amino acid sequences have a combined aspartateand glutamate composition [represented as C(DE)] of at least18%. This path further classifies soluble proteins based onthe length of their sequences, although the most discriminatingvariable is clearly the presence of acidic residues. Followingthe left branching path of the tree, insoluble proteins areselected based on the conditions that their sequences containfewer than 18% acidic residues [C(DE)], a long (at least 20residue) stretch of amino acids with a minimum hydrophobicityof less than –0.78 kcal/mol on the GES scale (30) (labeledHphobe) and a combined composition of acidic amino acids andtheir polar amides [C(DENQ)] under 16%.

The decision tree depicted in Figure 7B further isolates thetwo most discriminating features: acidic residue compositionand the presence of a hydrophobic stretch. Aside from theirmetal ion binding abilities, aspartic and glutamic acid arenegatively charged due to their carboxyl side chains. Thesehighly polar residues are often found on the surface of globularproteins, where they can interact favorably with solvent molecules.They have, in fact, the highest charge density per atom of allthe amino acids, a property obviously associated with solubility.The hydrophobic region identified is not long enough to be considereda transmembrane helix, but clearly identifies an ‘adhesive’area of the protein.

Decision tree learning produces a variety of tree topologiesdepending on the specific data and features used for training.We divided the 562 protein data set into random training andtesting sets of 66 and 33% of the input vectors, respectively,and built decision trees using all of the available features.A number of interesting patterns emerge based on the utilizationof classification features in various trees. Examining the decisiontree paths reveals intricate sorting based on amino acid compositionin addition to the most widely used features. For example, arule was discovered which selects soluble proteins having >18%DE composition, <8% arginine and >3% lysine residues.Another tree exhibited similar prediction success by combiningarginine and lysine into a common splitting variable immediatelyfollowing the 18% DE rule, identifying soluble proteins having<14% KR composition. However, aspartic and glutamic acidswere then isolated in lower levels of the tree, achieving afiner partitioning by sorting on the individual acidic residues.

Cross-validation
Overfitting can occur when a model performs well on the trainingset but fails to generalize to unseen examples. In these instancesthe algorithm has partitioned the data too finely and has mappeda decision surface to the training data that too closely followsintricacies in the feature space without extracting the underlyingtrends, essentially ‘memorizing’ the training set.In practice we can say that if an alternative learning solutionexists with a higher error rate but generalizes better overall available input vectors, overfitting has occurred. One wayof studying (and hence subsequently preventing) overfittingis cross-validation, which gives an estimate of the accuracyof a classifier based on re-sampling.

Stratified 10-fold cross-validation was performed on the decisiontrees, where each successive application of the learning procedureused a different 90% of the data set for training and the remaining10% for testing. Each of these training sets produced differenttrees from those constructed based on the entire data set. Usingthe testing sets for validation with their corresponding treemodels, we took the sum of the number of incorrect classificationsobtained from each one of the 10 test subsets and divided thatsum by the total number of instances that had been used fortesting (i.e. the total number of instances in the data set),thereby producing the estimated error for the entire tree. Thiscross-validation approach resulted in an overall predictionsuccess of 61–65% over the various data subsets. Thisdoes not correspond directly to the decision tree performancebased on the entire data set, as cross-validation results areproduced from many different partitions of the training andtesting sets.

While typically used for error estimation, cross-validationis not optimal for medium sized or ‘mesoscale’ datasets, such as our proteomics set. This is because the procedureexcludes a large fraction of the data during training, resultingin insufficiently sized testing sets. Consequently, other non-cross-validatedestimates of classification error have been developed. In thenext section we apply one such method, called pessimistic errorestimation (25).

Rule assessment
Regardless of the specific method of error estimation used,some paths, i.e. sequences of rules, within the decision treemay perform significantly better than others. These rules providea straightforward way for others to apply the classificationresults in a practical context. Moreover, a few simple rulesextracted from the tree may be considerably more robust to changesin the underlying data than the original tree topology. Consequently,we describe in this section a way to measure the quality ofa particular rule, in constrast to the overall estimate of atree’s performance reported above.

For this rule assessment we do not perform cross-validationat all, due to the scarcity of the data underlying any particularrule. Instead, we use Quinlan’s pessimistic estimationmethod, calculating a rule’s accuracy over the trainingexamples to which it applies and then calculating the standarddeviation in this estimated accuracy assuming a binomial distribution.More specifically, given the set C of training cases at nodeQ, its majority class and the number of cases outside that class,error-based pruning interprets C as a binomially distributedsample with well-defined confidence limits and estimates theQ error rate as the upper limit on its posterior probabilitydistribution. Equivalently, for a given confidence level thelower bound estimate is then taken as the measure of rule performance.

The default accuracy of choosing a soluble protein in our dataset is defined by S/T, where S is the number of soluble proteinsand T is the total number of proteins. The accuracy of the rulethat predicts solubility is s/t, where s is the number of proteinsreaching the leaf node at the end of a decision path and t isthe total number of proteins reaching that node. It is straightforwardto evaluate the probability that a randomly chosen rule willdo as well as or better than a decision rule with accuracy s/t.This probability is given by:

Note that the sum is over the hypergeometric distribution. Smallvalues of this measure correspond to good rules because thismeans there is a small probability that a rule has arisen bychance.

For example, the branching of the tree at the root, based onthe condition that the overall composition of aspartate andglutamate in protein sequences is >18%, defines a rule whichclassifies many proteins as soluble. This rule has an observedaccuracy of 108/136 (0.79) over the training set and a probabilityof 6 x 10^–9 of arising by chance. We must take into accountthe fact that the observed accuracy is overly optimistic andcorrect it by subtracting 1.96 times the binomial standard deviation(for the lower bound of a 95% confidence interval). For t >30 the binomial distribution can be approximated by the normaldistribution.

The probability that a random variable X, with mean 0, lieswithin a certain confidence range of width 2z is P(–z< X < z) = c. For a normal distribution the value of confidencec and the corresponding values of z are given in standard tables.In our case we want to standardize s/t. To do this we firstassert that the observed success rate s/t is generated froma Bernoulli process with success rate b. If t trials are taken,the expected success of the random variable s/t is the meanof a Bernoulli process b and the standard deviation:

The variable s/t can be standardized by subtracting it fromthe mean b and dividing by the standard deviation. The standardizedrandom variable X is defined as:

Assuming that t is large enough, the distribution of X approachesthe normal distribution. As mentioned above, the probabilitythat the random variable X with mean 0 lies within a certainconfidence range of width 2z is P(–z < X < z) =c, or explicitly:

Choosing a confidence probability c corresponds to a particularvalue of z [note that standard Z values are given for the one-tailedP(X < z). For example, P(X < z) = 5% corresponds to P(–z< X < z) = 90%]. Solving for the value of b will giveus the range of success rate and we will choose the lower boundto find the pessimistic error rate (success rate + error rate= 1; taking the largest error that corresponds to the smallestsuccess rate will yield the pessimistic error rate).

Inspecting the argument of the above equation, we can solvefor b at the boundaries +z and –z, i.e.

Then we can express the confidence range as:

and take the lower limit. Taking the pessimistic lower boundestimate for a 95% confidence interval gives an overall 0.71success ratio, in contrast to the default rule at the root ofthe tree, which has a success rate of 330/562 (0.59). The probabilityof this rule occurring by chance is <0.1%.

A statistically valid approach to estimate the true error (e_t)of a hypothesis within a 95% confidence interval is given interms of a sample error (e_s) and the sample size n (n > 30):

In addition to cross-validation and error estimation, modelcombination techniques were applied using decision trees derivedfrom random subsets of the available data. These methods includedbagging (bootstrap aggregating) and boosting (31), where eachnew model is influenced by the performance of those built previouslyand is trained to classify instances handled incorrectly byearlier ones. No significant improvement in prediction rateswas found with any of these approaches. Similarly, the approachof stacking several different classifiers, such as a decisiontree with a support vector machine, to another higher levelmeta-learner (e.g. another decision tree classifier) also didnot change the prediction accuracy.

Identification of potential crystallization targets
We also performed machine learning analyses on other aspectsof the proteomics data set, such as the potential for crystallization.An example decision tree built to classify 64 proteins basedon their tendency to crystallize is shown in Figure 6. Fromthis result it appears that the top level rule in the tree,aspartate composition of greater or less than 4.5%, is a discriminatingfeature. Significantly less data is available for this classificationtask than for solubility prediction, hence, these preliminaryresults are not statistically robust. When more data becomesavailable we should be able to derive rules relating other proteinattributes using the decision tree approach.

DISCUSSION

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

Comprehensive data management practices coupled with computationalanalysis can be a powerful tool for large scale proteomics.An interactive, dynamically modifiable database is an importantcomponent in collaborative research, enabling global proteintarget prioritization and synchronization of efforts acrossmany laboratories. If data resources are designed for subsequentanalysis, data mining strategies can be an effective way tomake sense of experimental data and discover hidden trends.Implementing robust, standardized archival procedures to maintaindata from disparate sources is critical to the success of largescale projects where many laboratories may be collaborating.In turn, the effective application of retrospective (post-experimental)analysis methods is dependent upon the availability of comprehensivedata sets having standard formats easily parsed by computerprograms.

	DISCUSSION

By its nature, large scale genomics and proteomics researchcannot be performed by a conventional single investigator laboratory.It will be carried out in large central facilities or via consortiaof many laboratories. Our system is designed to facilitate thelatter research model. This approach enables not only integrationof data from various sources, but also formulation of statisticalpredictions of various macromolecular properties, which canpotentially enhance the efficiency of laboratory research.

In particular, decision tree models feature a number of practicaladvantages, such as the straightforward interpretation of results,ability to mix numeric and symbolic features and invarianceto numeric scaling. The ability to devise prediction rules fromthe paths through the tree is perhaps the most powerful featureof this approach. Eventually we plan to do a comparative studyof several machine learning algorithms, to assess the capabilitiesof various methods for predicting macromolecular propertiesof new proteins based on the training sets produced by the database.

Database extensions: sparse data records and multiple expression constructs
The prototype database system is currently implemented as amulti-table relational model. The limited scalability of thisdesign may become problematical as the system expands to capturemore diverse experimental data, resulting in a larger numberof unused fields. To circumvent this ‘sparse matrix’problem future versions of the system are moving towards theentity attribute value (EAV) representation (32). This designwould allow various sets of database fields to be accessed andupdated, depending on the type of experiments typically performedby different investigators. Efforts are ongoing to standardizeand incorporate more experimental data into this format, sothat computational methods can be applied to a wider range offeatures.

A related issue concerns the way multiple expression constructshaving a shared protein target should be considered for analysis.In order to predict various properties of proteins it may benecessary in some cases to collapse the data from related expressionconstructs to the target protein level. Although this problemwas not encountered with the data sets used for the studiespresented here, it remains to be seen which approaches are mostsuitable for handling instances with this type of complexity.

In the future the results of data mining analysis may be incorporateddirectly into the database web site, instead of being computedoff-line. This more explicit integration could allow investigatorsto perform computational predictions on target proteins as theyare entered into the system.

Future directions: global surveys
SPINE currently focuses on the front end of large scale proteomicsefforts, collecting the experimental data generated before structureshave been determined. As the NESGC project matures we anticipatethat the database will incorporate more and more informationabout completed protein structures. The analytical theme willthen shift from optimization of high-throughput structure determinationto presenting a global survey or protein folds in various genomes,similar in spirit to a number of previous studies (33–36).

ACKNOWLEDGEMENTS

The authors wish to thank Anna Khachatryan for cloning the MTgenes, Steven Beasley, Brian Le and Anthony Semesi for proteinpurification and Arno Grbac for computer graphic design. Theauthors acknowledge support from the National Institutes ofHealth Protein Structure Initiative (P50 grant GM62413-01),the New Jersey Commission on Science and Technology, the MerckGenome Research Institute and the Ontario Research and DevelopmentChallenge Fund.

	ACKNOWLEDGEMENTS

FOOTNOTES

^* To whom correspondence should be addressed. Tel: +1 203 4326105; Fax: +1 360 838 7861; Email: mark.gerstein@yale.edu Theauthors wish it to be known that, in their opinion, the firsttwo authors should be regarded as joint First Authors

	FOOTNOTES

REFERENCES

TOP
ABSTRACT
INTRODUCTION
DATABASE SYSTEM REQUIREMENTS
SYSTEM FUNCTIONALITY
DATA MINING APPLICATIONS FOR...
DECISION TREE ANALYSIS
DISCUSSION
REFERENCES

1 Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and Wheeler,D.L. (2000) GenBank. Nucleic Acids Res., 28, 15–18.[Abstract/Full Text]

	REFERENCES

2 Tateno,Y., Miyazaki,S., Ota,M., Sugawara,H. and Gojobori,T. (2000) DNA bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res., 28, 24–26.[Abstract/Full Text]

3 Baker,W., van der Broek,A., Camon,E., Hingamp,P., Sterk,P., Stoesser,G. and Tuli,M.A. (2000) The EMBL nucleotide sequence database. Nucleic Acids Res., 28, 19–23.[Abstract/Full Text]

4 Barker,W.C., Garavelli,J.S., Huang,H., McGarvey,P.B., Orcutt,B., Srinivasarao,G.Y., Xiao,C., Yeh,L.S., Ledley,R.S., Janda,J.F., Pfeiffer,F., Mewes,H.W., Tsugita,A. and Wu,C. (2000) The Protein Information Resource (PIR). Nucleic Acids Res., 28, 41–44.[Abstract/Full Text]

5 Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48.[Abstract/Full Text]

6 Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242.[Abstract/Full Text]

7 Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hester,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M., Weng,S. and Botstein,D. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res., 26, 73–80.[Abstract/Full Text]

8 Mewes,H.W., Frishman,D., Gruber,C., Geier,B., Haase,D., Kaps,A., Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schuller,C., Stocker,S. and Weil,B. (2000) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 28, 37–40.[Abstract/Full Text]

9 Gelbart,W.M., Crosby,M., Matthews,B., Rindone,W.P., Chillemi,J., Russo Twombly,S., Emmert,D., Ashburner,M., Drysdale,R.A., Whitfield,E., Millburn,G.H., de Grey,A., Kaufman,T., Matthews,K., Gilbert,D., Strelets,V. and Tolstoshev,C. (1997) FlyBase: a Drosophila database. The FlyBase consortium. Nucleic Acids Res., 25, 63–66.[Abstract/Full Text]

10 Frishman,D., Heumann,K., Lesk,A. and Mewes,H.W. (1998) Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformatics, 14, 551–561.[Abstract]

11 Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637.[Abstract/Full Text]

12 Aach,J., Rindone,W. and Church,G.M. (2000) Systematic management and analysis of yeast gene expression data. Genome Res., 10, 431–445.[Abstract/Full Text]

13 Xenarios,I., Rice,D.W., Salwinski,L., Baron,M.K., Marcotte,E.M. and Eisenberg,D. (2000) DIP: the database of interacting proteins. Nucleic Acids Res., 28, 289–291.[Abstract/Full Text]

14 Bader,G.D. and Hogue,C.W. (2000) BIND—a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics, 16, 465–477.[Abstract]

15 Gerstein,M. and Krebs,W. (1998) A database of macromolecular motions. Nucleic Acids Res., 26, 4280–4290.[Medline]

16 Qian,J., Stenger,B., Wilson,C.A., Lin,J., Jansen,R., Teichmann,S.A., Park,J., Krebs,W.G., Yu,H., Alexandrov,V., Echols,N. and Gerstein,M. (2001) PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res., 29, 1750–1764.[Abstract/Full Text]

17 Gerstein,M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33, 518–534.[Medline]

18 Lin,J. and Gerstein,M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res., 10, 808–818.[Abstract/Full Text]

19 Brenner,S.E., Barken,D. and Levitt,M. (1999) The Presage database for structural genomics. Nucleic Acids Res., 27, 251–253.[Abstract/Full Text]

20 Goodman,N., Rozen,S. and Stein,L. (1995) Labbase: a database to manage laboratory data in a large-scale genome-mapping project. IEEE Comput. Med. Biol., 14, 702–709.

21 Yona,G., Linial,N. and Linial,M. (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res., 28, 49–55.[Abstract/Full Text]

22 Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.[Medline]

23 Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 8, 1093–1108.

24 Christendat,D., Yee,A., Dharamsi,A., Kluger,Y., Savchenko,A., Cort,J.R., Booth,V., Mackereth,C.D., Saridakis,V., Ekiel,I., Kozlov,G., Maxwell,K.L., Wu,N., McIntosh,L.P., Gehring,K., Kennedy,M.A., Davidson,A.R., Pai,E.F., Gerstein,M., Edwards,A.M. and Arrowsmith,C.H. (2000) Structural proteomics of an archeon. Nat. Struct. Biol., 7, 903–909.[Medline]

25 Quinlan,J.R. (1987) Simplifying decision trees. Int. J. Man–Machine Stud., 27, 221–234.

26 Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kauffman.

27 Dash,M. and Liu,H. (1997) Feature selection for classification. Intelligent Data Anal., 1, 131–156.

28 Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA.

29 Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learn., 20, 273–297.

30 Engelman,D.M., Steitz,T.A. and Goldman,A. (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu. Rev. Biophys. Biophys. Chem., 15, 321–353.[Medline]

31 Quinlan,J.R. (1996) Bagging, boosting and C4.5. In Proceedings of the Fourteenth National Conference on Artificial Intelligence.

32 Nadkarni,P.M., Marenco,L., Chen,R., Skoufos,E., Shepherd,G. and Miller,P. (1999) Organization of heterogeneous scientific data using the EAV/CR representation. J. Am. Med. Inform. Assoc., 6, 478–493.[Abstract/Full Text]

33 Gerstein,M., Lin,J. and Hegyi,H. (2000) Protein folds in the worm genome. Pac. Symp. Biocomput., 30–41.

34 Gerstein,M. and Hegyi,H. (1998) Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol. Rev., 22, 277–304.[Medline]

35 Gerstein,M. (1998) How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des., 3, 497–512.[Medline]

36 Gerstein,M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562–576.[Medline]

37 Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571.[Medline]

38 Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120, 97–120.[Medline]

Abstract of this Article

Reprint (PDF) Version of this Article

Similar articles found in:
Nucl. Acids. Res. Online
PubMed

PubMed Citation

Search Medline for articles by:
Bertone, P. || Gerstein, M.

Alert me when:
new articles cite this article

Download to Citation Manager