HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
|
Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA, 1Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA, 2Center for Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, NJ 08854, USA and 3Ontario Cancer Institute and Department of Medical Biophysics, University of Toronto, Ontario M5G 2M9, Canada
Received January 5, 2001; Revised and Accepted April 23, 2001.
ABSTRACT |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
INTRODUCTION |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
Many biological databases are developed and maintained strictly for warehousing purposes, without consideration of the analyses that may be performed on the data. Conversely, computational studies are often performed outside the context of information management, without a clear connection to biological reality. Our work explores a fusion of these two processes, where database design is influenced by analytical requirements.
Such an undertaking requires a centralized repository to integrate and manage the data generated, coupled with strategies for subsequent computational analysis. By maintaining a shared infrastructure accessible to all the participating members of a project, distributed access to large subsets of data is possible. This not only promotes collaborative effort among investigators by providing a common information exchange platform, but also avoids costly and time-consuming duplication of experimental work. Further, data is maintained in a consistent format across many laboratories and investigators, promoting further analysis.
To this end we have developed the SPINE database and analysis system, an integrated approach to interactive database system design and computational analysis in a distributed framework, using the recently formed Northeast Structural Genomic Consortium (www.nesg.org) as a model for multi-laboratory collaborative research. The system is designed to generate standardized data files from user-definable subsets of the proteomics information entered into the database, which are then used for classification tasks. Key issues in effective data mining are introduced, with emphasis on decision trees, a supervised machine learning approach. We conclude with a discussion of prediction results for several macromolecular properties based on features derived from the database contents.
DATABASE SYSTEM REQUIREMENTS |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
|
An obvious candidate for the fundamental database unit is the protein. However, in certain instances homologous proteins from other organisms prove more experimentally tractable than the actual target; this scenario would be a source of confusion when maintaining a resource based on proteins. An alternative is to focus on the expression construct made for a given protein. Multiple constructs can be made for a single protein, because a construct could be designed to express only a single domain from a complex protein or contain a slightly altered protein sequence that aids in protein production and structure determination. This one-to-many relationship between target proteins and their associated expression constructs would imply that several database entries might be related to the same target. A third option is to use the specific preparation associated with each experiment, where a database record could represent a set of conditions by which a protein sample is prepared. An immediate concern with this representation is that protein preparations will vary constantly, requiring an unforeseeable number of relational tables to accommodate their parameters.
Because multiple constructs can be generated for each target, the single protein representation is too limited for our purposes. Conversely, experimental conditions for individual protein preparations are highly variable and it was decided that this data should be compiled separately. From these candidates it was decided that the expression construct captured the most appropriate level of detail for this project. It was selected as the basic unit to be tracked by the database, essentially recording the best experimental results for the expression, purification and characterization of each target protein.
To address these requirements, software components were developed for entry and updating of expression construct records, database searching, bulk data retrieval and tracking the global progress of the entire project. Intuitive HTML form-based interfaces were implemented to facilitate distributed Internet access from participating laboratories. The implementation of the database system is described in detail in Figure 2. The modular organization of software components permits relatively straightforward implementation of additional functionality. This aspect is independent of the underlying database architecture, allowing a great deal of programming flexibility while maintaining strict compliance to the standardized data types established for various experimental parameters.
|
To accommodate the needs of various Consortium projects where different experimental methodologies are used, principal investigators from several laboratories were involved in the process of selecting the most appropriate information to be tracked by the system. Fields from existing data sets were used to develop a consensus of experimental parameters and this was adapted to the current database framework.
A listing of the fields used for the prototype database is shown in Table 1. The information maintained by the system initially comprised 63 fields for protein sequences, cloning parameters, expression level and purification yield and data derived from biophysical characterization and structural biology experiments (oligomerization, CD, HSQC, NMR and X-ray crystallography). In addition, a number of fields are devoted to keeping track of the laboratory and investigator responsible for working with the target protein, dates when experiments were performed, comments relating to experimental conditions for each group of related fields and variable access levels for individual database records. The database is not intended to manage all aspects of experimental research; rather, it is designed to standardize and track key parameters related to structural proteomics. However, the system does include user accounts, transaction history information and some laboratory management tables.
|
|
For many experimental methods a data file is generated comprising an entire set of results distinct from the information tracked by the main database. The inclusion of these parameters into the existing infrastructure would be beyond the scope of the system; HSQC spectra, X-ray diffraction data and NMR assignments can span large files that would be impractical to incorporate directly into database tables. Instead, these are stored on a separate file server and the associated URL addresses are recorded in construct records and linked to each record display. Thus, a key feature of the system is maintaining a central collection of pointers to additional experimental data sets. This mechanism is, of course, extended to allow pointers into other information repositories associated with the project, for instance into a crystallization database or a list of targets. We also link the system with other protein sequence and structure resources, such as SWISS-PROT (5), PartsList (16), GeneCensus (17,18), ProtoMap (21), SCOP (22) and CATH (23).
User interaction and dynamic content modification
The design of the systems front end allows expression construct records to be entered, edited and retrieved by individual users without frequent intervention of a database curator. An important goal in this work is to design a system that works in a practical laboratory setting, i.e. the software is operationally robust and straightforward, so that using it on a regular basis will not disrupt work flow. The system provides a consistent and intuitive user interface to complex database functions, as well as error recovery features when conflicting or incomplete information is submitted. Search functions were developed for the intelligent retrieval and display of information from the database, as well as the ability to generate bulk dumps of large subsets of data records and protein sequences in interchangeable file formats, including CSV and XML.
As experimental work progresses on a given target, additional data is collected which may have been unavailable at the time its expression construct record was created. Therefore, an essential requirement of the database is the ability to recall records to alter or augment their associated information. Consequently, the contents of individual database records are changing over time in a user-mediated fashion, in contrast to more archive-oriented resources. This imposes additional sets of operational considerations, requiring provisions to ensure internal ID consistency and overwrite protection when users enter or modify database records.
SYSTEM FUNCTIONALITY |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
|
Searching the database and visualizing progress
The retrieval of records from the database is accomplished through the use of a search engine interface (Fig. 5A), where a variety of terms may be selected and combined with Boolean connectives. Based on the values of the elements submitted via the interface form, the software builds an SQL query to execute against the database and returns any records matching the search terms (Fig. 5B). The subset of database records returned by the search may be optionally downloaded as a CSV formatted text file, suitable for importing into another database or spreadsheet program. Individual expression construct records are displayed in a static web page (Fig. 5C), with database fields organized in a logical hierarchy. A number of local and distributed Internet resources are automatically linked to record display pages, such as Protein Data Bank searching, organism-specific databases and specialized structural annotation reports.
|
DATA MINING APPLICATIONS FOR HIGH-THROUGHPUT PROTEOMICS |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
Machine learning concepts
The term machine learning applies to a wide range of computational methodologies. However, the models most suitable for our applications belong to the class of algorithms that employ supervised learning. Under supervised learning the classification process consists of two phases: training and testing. The set of all available examples or instances (formally termed input vectors) is divided into two non-intersecting sets. The first set is used to train the model. During this phase correct classification of the examples is known a priori. Supervised learning strategies rely on this information to adjust the performance of the model until the classification error rate is sufficiently reduced. Learning is no longer performed after training is completed; instead, unseen instances in the test set are classified according to the partitioning established during the training phase. The performance of a learning algorithm is determined by its ability to correctly classify new instances not present in the initial training set.
The features of each sample can be represented as a vector that corresponds to a point in an n-dimensional space. Classification is then performed by partitioning this feature space into regions, where most of the points in a region correspond to a particular category. The goal in training classifiers is to find an optimal partitioning of the input space separating the highest number of disparate examples. An ideal classifier will demonstrate strong predictive power, while explaining the relationships between the variable to be predicted and the variables comprising the feature space.
Machine learning applications to proteomics data
One property of a proteomics feature set that one must adhere to is the appropriate time frame in which classifications are performed. In many cases the experimental results are serially related, constraining the composition of useful training sets to expression constructs having a priori prediction data. For example, one cannot expect to optimally classify crystallization targets if the available training set contains experimental results only up to the expression stage, because the available feature set will not contain a response variable for crystallization. Conversely, it is entirely possible to classify proteins based on some property corresponding to an earlier experimental stage, e.g. solubility data has already been gathered for proteins having HSQC spectra, enabling one to train a classifier to partition these proteins based on solubility information.
While there are many possible issues that data mining can address in relation to the proteomics data collected by the Consortium, we have focused on protein solubility prediction due to the importance of this property and the availability of a large set of Methanobacterium thermoautotrophicum expression construct records having solubility measurements. The size of this data set provides the best opportunity for generalization during training, increasing an algorithms prediction success when presented with new examples. An accurate prediction method for this property can also be an extremely useful tool, as insolubility accounts for almost 60% of experimentally recalcitrant proteins (24). Here we refer to solubility as soluble in the cell extract, a property that is correlated with, but not necessarily identical to, the solubility of a purified protein.
In a supervised learning approach for solubility prediction the training set consists of a subset of input vectors extracted from the database and is used by the classifier model to partition the sample space based on solubility, the dependent variable to be predicted. After training the feature space will be partitioned into two regions: one containing proteins labeled as soluble and another with proteins labeled as insoluble. The second part of this approach is to determine a trained classifiers ability to generalize to unseen examples, by presenting the model with a test set containing new feature vectors and re-evaluating its performance.
Methanobacterium thermoautotrophicum data set
A data set comprising 562 proteins from the M.thermoautotrophicum genome was compiled from the database and used for machine learning. Although SPINE currently holds 740 construct entries for this organism, 178 of these do not have solubility information and thus are not suitable for classification. As summarized in Table 2, a total of 42 features were extracted from the corresponding protein sequences, such as amino acid composition, hydrophobicity, occurrence of low complexity regions, secondary structure, etc. Combined with the database fields highlighted in Table 1, these features comprise the input vector used for the classification study presented here.
|
It should be noted that prediction results for a proteomics data set may exhibit some degree of specificity to the expression vectors and experimental conditions of cell growth, induction, etc. used for protein production. A characteristic of this specific M.thermoautotrophicum data is the uniform set of conditions that were used to prepare protein samples (26). Additionally, the experimental targets selected by the Consortium consist largely of non-membrane proteins, so the available data set is biased in this regard.
DECISION TREE ANALYSIS |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
Model description
Decision tree learning (25,26) is a widely used and effective method that can partition data that is not linearly separable (Fig. 6). An individual object of unknown type may be classified by traversing the tree. At each internal node a test is performed on the objects value for the feature expressed at that node (often called the splitting variable, e.g. alanine composition). Based on this value, the appropriate branch is followed to the next node. This procedure continues until a leaf node is reached and the objects classification is determined. In classifying a given object a variable number of evaluations may be performed or omitted, depending on the path taken when the tree is traversed. In this manner a heuristic search is performed to find a compact, consistent solution that generalizes to unseen examples.
|
A number of advantages are evident in the decision tree model. Classification can be based on an arbitrary mixture of symbolic and numeric variables and (for axis-parallel splitting) one is not required to scale the variables relative to each other. The model is generally robust when presented with missing values. In addition, straightforward and concise rules can be inferred from the tree by following the path from root to leaf nodes.
Feature selection
We used decision trees to partition the 562 M.thermoautotrophicum proteins into soluble and insoluble classes, based on a subset of the features listed in Table 2. The features that are relevant to a given problem domain are often unknown a priori and removing those which are redundant or irrelevant can produce a simpler model which generalizes better to unseen examples. Automated feature selection attempts to find a minimal subset of the available features, in order to either improve classification performance or to simplify the models structure while preserving prediction accuracy (27). Typically, a search algorithm is used to partition the available feature set. Classifiers are then trained on the feature combinations presented by the search algorithm to identify those features which have the greatest impact on learning. In our case we used a genetic algorithm (28) to search the space of possible feature combinations; the relevance of individual feature subsets was estimated with several machine learning methods, including decision trees and support vector machines (29). We arrived at a feature subset consisting of the amino acids E, I, T and Y, combined compositions of basic (KR), acidic (DE) and aromatic (FYW) residues, the acidic residues with their amides (DENQ), the presence of signal sequences and hydrophobic regions, secondary structure features and low complexity elements. These are highlighted in Table 2.
Decision tree results
The trees that were trained on this data set had a misclassification rate of 12%. Decision trees built on the training set are always overly optimistic and contain a large number of nodes. Only the upper region of the tree is significant in terms of yielding a generalized concept and this is the segment from which useful rules can be derived. After training and pruning of the decision trees we extracted several classification rules for distinguishing between soluble and insoluble proteins, as described in Figure 7. Two trees are shown in this example. Figure 7A illustrates the upper five levels of a decision tree built on the entire set of 562 proteins and subjected to cross-validation. The tree in Figure 7B was trained on a 375 protein subset of the data and tested with the remaining 187.
|
The decision tree depicted in Figure 7B further isolates the two most discriminating features: acidic residue composition and the presence of a hydrophobic stretch. Aside from their metal ion binding abilities, aspartic and glutamic acid are negatively charged due to their carboxyl side chains. These highly polar residues are often found on the surface of globular proteins, where they can interact favorably with solvent molecules. They have, in fact, the highest charge density per atom of all the amino acids, a property obviously associated with solubility. The hydrophobic region identified is not long enough to be considered a transmembrane helix, but clearly identifies an adhesive area of the protein.
Decision tree learning produces a variety of tree topologies depending on the specific data and features used for training. We divided the 562 protein data set into random training and testing sets of 66 and 33% of the input vectors, respectively, and built decision trees using all of the available features. A number of interesting patterns emerge based on the utilization of classification features in various trees. Examining the decision tree paths reveals intricate sorting based on amino acid composition in addition to the most widely used features. For example, a rule was discovered which selects soluble proteins having >18% DE composition, <8% arginine and >3% lysine residues. Another tree exhibited similar prediction success by combining arginine and lysine into a common splitting variable immediately following the 18% DE rule, identifying soluble proteins having <14% KR composition. However, aspartic and glutamic acids were then isolated in lower levels of the tree, achieving a finer partitioning by sorting on the individual acidic residues.
Cross-validation
Overfitting can occur when a model performs well on the training set but fails to generalize to unseen examples. In these instances the algorithm has partitioned the data too finely and has mapped a decision surface to the training data that too closely follows intricacies in the feature space without extracting the underlying trends, essentially memorizing the training set. In practice we can say that if an alternative learning solution exists with a higher error rate but generalizes better over all available input vectors, overfitting has occurred. One way of studying (and hence subsequently preventing) overfitting is cross-validation, which gives an estimate of the accuracy of a classifier based on re-sampling.
Stratified 10-fold cross-validation was performed on the decision trees, where each successive application of the learning procedure used a different 90% of the data set for training and the remaining 10% for testing. Each of these training sets produced different trees from those constructed based on the entire data set. Using the testing sets for validation with their corresponding tree models, we took the sum of the number of incorrect classifications obtained from each one of the 10 test subsets and divided that sum by the total number of instances that had been used for testing (i.e. the total number of instances in the data set), thereby producing the estimated error for the entire tree. This cross-validation approach resulted in an overall prediction success of 6165% over the various data subsets. This does not correspond directly to the decision tree performance based on the entire data set, as cross-validation results are produced from many different partitions of the training and testing sets.
While typically used for error estimation, cross-validation is not optimal for medium sized or mesoscale data sets, such as our proteomics set. This is because the procedure excludes a large fraction of the data during training, resulting in insufficiently sized testing sets. Consequently, other non-cross-validated estimates of classification error have been developed. In the next section we apply one such method, called pessimistic error estimation (25).
Rule assessment
Regardless of the specific method of error estimation used, some paths, i.e. sequences of rules, within the decision tree may perform significantly better than others. These rules provide a straightforward way for others to apply the classification results in a practical context. Moreover, a few simple rules extracted from the tree may be considerably more robust to changes in the underlying data than the original tree topology. Consequently, we describe in this section a way to measure the quality of a particular rule, in constrast to the overall estimate of a trees performance reported above.
For this rule assessment we do not perform cross-validation at all, due to the scarcity of the data underlying any particular rule. Instead, we use Quinlans pessimistic estimation method, calculating a rules accuracy over the training examples to which it applies and then calculating the standard deviation in this estimated accuracy assuming a binomial distribution. More specifically, given the set C of training cases at node Q, its majority class and the number of cases outside that class, error-based pruning interprets C as a binomially distributed sample with well-defined confidence limits and estimates the Q error rate as the upper limit on its posterior probability distribution. Equivalently, for a given confidence level the lower bound estimate is then taken as the measure of rule performance.
The default accuracy of choosing a soluble protein in our data set is defined by S/T, where S is the number of soluble proteins and T is the total number of proteins. The accuracy of the rule that predicts solubility is s/t, where s is the number of proteins reaching the leaf node at the end of a decision path and t is the total number of proteins reaching that node. It is straightforward to evaluate the probability that a randomly chosen rule will do as well as or better than a decision rule with accuracy s/t. This probability is given by:
Note that the sum is over the hypergeometric distribution. Small values of this measure correspond to good rules because this means there is a small probability that a rule has arisen by chance.
For example, the branching of the tree at the root, based on the condition that the overall composition of aspartate and glutamate in protein sequences is >18%, defines a rule which classifies many proteins as soluble. This rule has an observed accuracy of 108/136 (0.79) over the training set and a probability of 6 x 109 of arising by chance. We must take into account the fact that the observed accuracy is overly optimistic and correct it by subtracting 1.96 times the binomial standard deviation (for the lower bound of a 95% confidence interval). For t > 30 the binomial distribution can be approximated by the normal distribution.
The probability that a random variable X, with mean 0, lies within a certain confidence range of width 2z is P(z < X < z) = c. For a normal distribution the value of confidence c and the corresponding values of z are given in standard tables. In our case we want to standardize s/t. To do this we first assert that the observed success rate s/t is generated from a Bernoulli process with success rate b. If t trials are taken, the expected success of the random variable s/t is the mean of a Bernoulli process b and the standard deviation:
The variable s/t can be standardized by subtracting it from the mean b and dividing by the standard deviation. The standardized random variable X is defined as:
Assuming that t is large enough, the distribution of X approaches the normal distribution. As mentioned above, the probability that the random variable X with mean 0 lies within a certain confidence range of width 2z is P(z < X < z) = c, or explicitly:
Choosing a confidence probability c corresponds to a particular value of z [note that standard Z values are given for the one-tailed P(X < z). For example, P(X < z) = 5% corresponds to P(z < X < z) = 90%]. Solving for the value of b will give us the range of success rate and we will choose the lower bound to find the pessimistic error rate (success rate + error rate = 1; taking the largest error that corresponds to the smallest success rate will yield the pessimistic error rate).
Inspecting the argument of the above equation, we can solve for b at the boundaries +z and z, i.e.
Then we can express the confidence range as:
and take the lower limit. Taking the pessimistic lower bound estimate for a 95% confidence interval gives an overall 0.71 success ratio, in contrast to the default rule at the root of the tree, which has a success rate of 330/562 (0.59). The probability of this rule occurring by chance is <0.1%.
A statistically valid approach to estimate the true error (et) of a hypothesis within a 95% confidence interval is given in terms of a sample error (es) and the sample size n (n > 30):
In addition to cross-validation and error estimation, model combination techniques were applied using decision trees derived from random subsets of the available data. These methods included bagging (bootstrap aggregating) and boosting (31), where each new model is influenced by the performance of those built previously and is trained to classify instances handled incorrectly by earlier ones. No significant improvement in prediction rates was found with any of these approaches. Similarly, the approach of stacking several different classifiers, such as a decision tree with a support vector machine, to another higher level meta-learner (e.g. another decision tree classifier) also did not change the prediction accuracy.
Identification of potential crystallization targets
We also performed machine learning analyses on other aspects of the proteomics data set, such as the potential for crystallization. An example decision tree built to classify 64 proteins based on their tendency to crystallize is shown in Figure 6. From this result it appears that the top level rule in the tree, aspartate composition of greater or less than 4.5%, is a discriminating feature. Significantly less data is available for this classification task than for solubility prediction, hence, these preliminary results are not statistically robust. When more data becomes available we should be able to derive rules relating other protein attributes using the decision tree approach.
DISCUSSION |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
By its nature, large scale genomics and proteomics research cannot be performed by a conventional single investigator laboratory. It will be carried out in large central facilities or via consortia of many laboratories. Our system is designed to facilitate the latter research model. This approach enables not only integration of data from various sources, but also formulation of statistical predictions of various macromolecular properties, which can potentially enhance the efficiency of laboratory research.
In particular, decision tree models feature a number of practical advantages, such as the straightforward interpretation of results, ability to mix numeric and symbolic features and invariance to numeric scaling. The ability to devise prediction rules from the paths through the tree is perhaps the most powerful feature of this approach. Eventually we plan to do a comparative study of several machine learning algorithms, to assess the capabilities of various methods for predicting macromolecular properties of new proteins based on the training sets produced by the database.
Database extensions: sparse data records and multiple expression constructs
The prototype database system is currently implemented as a multi-table relational model. The limited scalability of this design may become problematical as the system expands to capture more diverse experimental data, resulting in a larger number of unused fields. To circumvent this sparse matrix problem future versions of the system are moving towards the entity attribute value (EAV) representation (32). This design would allow various sets of database fields to be accessed and updated, depending on the type of experiments typically performed by different investigators. Efforts are ongoing to standardize and incorporate more experimental data into this format, so that computational methods can be applied to a wider range of features.
A related issue concerns the way multiple expression constructs having a shared protein target should be considered for analysis. In order to predict various properties of proteins it may be necessary in some cases to collapse the data from related expression constructs to the target protein level. Although this problem was not encountered with the data sets used for the studies presented here, it remains to be seen which approaches are most suitable for handling instances with this type of complexity.
In the future the results of data mining analysis may be incorporated directly into the database web site, instead of being computed off-line. This more explicit integration could allow investigators to perform computational predictions on target proteins as they are entered into the system.
Future directions: global surveys
SPINE currently focuses on the front end of large scale proteomics efforts, collecting the experimental data generated before structures have been determined. As the NESGC project matures we anticipate that the database will incorporate more and more information about completed protein structures. The analytical theme will then shift from optimization of high-throughput structure determination to presenting a global survey or protein folds in various genomes, similar in spirit to a number of previous studies (3336).
ACKNOWLEDGEMENTS |
---|
FOOTNOTES |
---|
REFERENCES |
---|
TOP ABSTRACT INTRODUCTION DATABASE SYSTEM REQUIREMENTS SYSTEM FUNCTIONALITY DATA MINING APPLICATIONS FOR... DECISION TREE ANALYSIS DISCUSSION REFERENCES |
---|
2 Tateno,Y., Miyazaki,S., Ota,M., Sugawara,H. and Gojobori,T. (2000) DNA bank of Japan (DDBJ) in collaboration with mass sequencing teams. Nucleic Acids Res., 28, 2426.
3 Baker,W., van der Broek,A., Camon,E., Hingamp,P., Sterk,P., Stoesser,G. and Tuli,M.A. (2000) The EMBL nucleotide sequence database. Nucleic Acids Res., 28, 1923.
4 Barker,W.C., Garavelli,J.S., Huang,H., McGarvey,P.B., Orcutt,B., Srinivasarao,G.Y., Xiao,C., Yeh,L.S., Ledley,R.S., Janda,J.F., Pfeiffer,F., Mewes,H.W., Tsugita,A. and Wu,C. (2000) The Protein Information Resource (PIR). Nucleic Acids Res., 28, 4144.
5 Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 4548.
6 Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242.
7 Cherry,J.M., Adler,C., Ball,C., Chervitz,S.A., Dwight,S.S., Hester,E.T., Jia,Y., Juvik,G., Roe,T., Schroeder,M., Weng,S. and Botstein,D. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Res., 26, 7380.
8 Mewes,H.W., Frishman,D., Gruber,C., Geier,B., Haase,D., Kaps,A., Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schuller,C., Stocker,S. and Weil,B. (2000) MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 28, 3740.
9 Gelbart,W.M., Crosby,M., Matthews,B., Rindone,W.P., Chillemi,J., Russo Twombly,S., Emmert,D., Ashburner,M., Drysdale,R.A., Whitfield,E., Millburn,G.H., de Grey,A., Kaufman,T., Matthews,K., Gilbert,D., Strelets,V. and Tolstoshev,C. (1997) FlyBase: a Drosophila database. The FlyBase consortium. Nucleic Acids Res., 25, 6366.
10 Frishman,D., Heumann,K., Lesk,A. and Mewes,H.W. (1998) Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformatics, 14, 551561.
11 Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. Science, 278, 631637.
12 Aach,J., Rindone,W. and Church,G.M. (2000) Systematic management and analysis of yeast gene expression data. Genome Res., 10, 431445.
13 Xenarios,I., Rice,D.W., Salwinski,L., Baron,M.K., Marcotte,E.M. and Eisenberg,D. (2000) DIP: the database of interacting proteins. Nucleic Acids Res., 28, 289291.
14 Bader,G.D. and Hogue,C.W. (2000) BINDa data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics, 16, 465477.
15 Gerstein,M. and Krebs,W. (1998) A database of macromolecular motions. Nucleic Acids Res., 26, 42804290.
16 Qian,J., Stenger,B., Wilson,C.A., Lin,J., Jansen,R., Teichmann,S.A., Park,J., Krebs,W.G., Yu,H., Alexandrov,V., Echols,N. and Gerstein,M. (2001) PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res., 29, 17501764.
17 Gerstein,M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33, 518534.
18 Lin,J. and Gerstein,M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res., 10, 808818.
19 Brenner,S.E., Barken,D. and Levitt,M. (1999) The Presage database for structural genomics. Nucleic Acids Res., 27, 251253.
20 Goodman,N., Rozen,S. and Stein,L. (1995) Labbase: a database to manage laboratory data in a large-scale genome-mapping project. IEEE Comput. Med. Biol., 14, 702709.
21 Yona,G., Linial,N. and Linial,M. (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res., 28, 4955.
22 Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540.
23 Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATHa hierarchic classification of protein domain structures. Structure, 8, 10931108.
24 Christendat,D., Yee,A., Dharamsi,A., Kluger,Y., Savchenko,A., Cort,J.R., Booth,V., Mackereth,C.D., Saridakis,V., Ekiel,I., Kozlov,G., Maxwell,K.L., Wu,N., McIntosh,L.P., Gehring,K., Kennedy,M.A., Davidson,A.R., Pai,E.F., Gerstein,M., Edwards,A.M. and Arrowsmith,C.H. (2000) Structural proteomics of an archeon. Nat. Struct. Biol., 7, 903909.
25 Quinlan,J.R. (1987) Simplifying decision trees. Int. J. ManMachine Stud., 27, 221234.
26 Quinlan,J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kauffman.
27 Dash,M. and Liu,H. (1997) Feature selection for classification. Intelligent Data Anal., 1, 131156.
28 Goldberg,D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA.
29 Cortes,C. and Vapnik,V. (1995) Support vector networks. Machine Learn., 20, 273297.
30 Engelman,D.M., Steitz,T.A. and Goldman,A. (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu. Rev. Biophys. Biophys. Chem., 15, 321353.
31 Quinlan,J.R. (1996) Bagging, boosting and C4.5. In Proceedings of the Fourteenth National Conference on Artificial Intelligence.
32 Nadkarni,P.M., Marenco,L., Chen,R., Skoufos,E., Shepherd,G. and Miller,P. (1999) Organization of heterogeneous scientific data using the EAV/CR representation. J. Am. Med. Inform. Assoc., 6, 478493.
33 Gerstein,M., Lin,J. and Hegyi,H. (2000) Protein folds in the worm genome. Pac. Symp. Biocomput., 3041.
34 Gerstein,M. and Hegyi,H. (1998) Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol. Rev., 22, 277304.
35 Gerstein,M. (1998) How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des., 3, 497512.
36 Gerstein,M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol., 274, 562576.
37 Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554571.
38 Garnier,J., Osguthorpe,D.J. and Robson,B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120, 97120.
Abstract of this Article
Reprint (PDF) Version of this Article
Similar articles found in:
Nucl. Acids. Res. Online
PubMed
PubMed Citation
Search Medline for articles by:
Bertone, P.
||
Gerstein, M.
Alert me when:
new articles cite this article
Download to Citation Manager
HOME
HELP
FEEDBACK
SUBSCRIPTIONS
ARCHIVE
SEARCH
TABLE OF CONTENTS