|
Vol. 13, Issue 4, 703-716, April 2003
1 Department of Genetics, Yale University, New Haven, Connecticut 06520, USA; 2 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA; 3 Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel; 4 Department of Statistics, Yale University, New Haven, Connecticut 06520, USA; 5 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classification problems are linked, and one wants to find "marker genes" that are differentially expressed in particular sets of "conditions." We have developed a method that simultaneously clusters genes and conditions, finding distinctive "checkerboard" patterns in matrices of gene expression data, if they exist. In a cancer context, these checkerboards correspond to genes that are markedly up- or downregulated in patients with particular types of tumors. Our method, spectral biclustering, is based on the observation that checkerboard structures in matrices of expression data can be found in eigenvectors corresponding to characteristic expression patterns across genes or conditions. In addition, these eigenvectors can be readily identified by commonly used linear algebra approaches, in particular the singular value decomposition (SVD), coupled with closely integrated normalization steps. We present a number of variants of the approach, depending on whether the normalization over genes and conditions is done independently or in a coupled fashion. We then apply spectral biclustering to a selection of publicly available cancer expression data sets, and examine the degree to which the approach is able to identify checkerboard structures. Furthermore, we compare the performance of our biclustering methods against a number of reasonable benchmarks (e.g., direct application of SVD or normalized cuts to raw data).
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Microarray Analysis to Classify Genes and Phenotypes
Microarray experiments for simultaneously
measuring RNA expression levels of thousands of genes are becoming
widely used in genomic research. They have enormous promise in such
areas as revealing function of genes in various cell populations, tumor classification, drug target identification, understanding cellular pathways, and prediction of outcome to therapy (Brown and Botstein 1999; Lockhart and Winzeler 2000
). A major application of microarray technology is gene expression profiling to predict outcome in multiple
tumor types (Golub et al. 1999
). In a bioinformatics context, we can
apply various data-mining methods to cancer datasets in order to
identify class distinction genes and to classify tumors. A partial list
of methods includes: (1) data preprocessing (background elimination,
identification of differentially expressed genes, and normalization);
(2) unsupervised clustering and visualization methods (hierarchical,
SOM, k-means, and SVD); (3) supervised machine learning methods for
classification based on prior knowledge (discriminant analysis,
support-vector machines, decision trees, neural networks, and k-nearest
neighbors); and (4) more ambitious genetic network models (requiring
large amounts of data) that are designed to discover biological
pathways using such approaches as pairwise interactions, continuous or
Boolean networks (based on a system of coupled differential equations),
and probabilistic graph modeling based on Bayesian networks (Tamayo et
al. 1999
; Brown et al. 2000
; Friedman et al. 2000
).
Our focus here is on unsupervised clustering methods. Unsupervised
techniques are useful when labels are unavailable. Examples include
attempts to identify (yet unknown) subclasses of tumors, or work on
identifying clusters of genes that are coregulated or share the same
function (Brown et al. 2000; Mateos et al. 2002
). Unsupervised methods
have been successful in separating certain types of tumors associated
with different types of leukemia and lymphoma (Golub et al. 1999
;
Alizadeh et al. 2000
; Klein et al. 2001
). However, unsupervised (and
even supervised) methods have had less success in partitioning the
samples according to tumor type or outcome in diseases with multiple
subclassifications (Pomeroy et al. 2002
; van't Veer et al. 2002
). In
addition, the methods we propose here are related to a method of
Dhillon (2001)
for coclustering of words and documents.
Checkerboard Structures of Genes and Conditions in Microarray Datasets
As a starting point in analyzing microarray cancer datasets, it is worthwhile to appreciate the assumed structure of these data (e.g., whether they can be organized in a checkerboard pattern), and to design a clustering algorithm that is suitable for this structure. In particular, in analyzing microarray cancer data sets we may wish to identify both clusters of genes that participate in common regulatory networks and clusters of experimental conditions associated with the effects of these genes, for example, clusters of cancer subtypes. In both cases we may want to use similarities between expression level patterns to determine clusters. Clearly, advance knowledge of clusters of genes can help in clustering experimental conditions, and vice versa. In the absence of knowledge of gene and condition classes, it would be useful to develop partitioning algorithms that find latent classes by exploiting relations between genes and conditions. Exploiting the underlying two-sided data structure could help the simultaneous clustering, leading to meaningful gene and experimental condition clusters.
The raw data in many cancer gene-expression datasets can be arranged in a matrix form as schematized in Figure 1. In this matrix, which we denote by A, the genes index rows i and the different conditions (e.g., different patients) index the columns j. Depending on the type of chip technology used, a value in this matrix Aij could either represent absolute expression levels (such as from Affymetrix GeneChips) or relative expression ratios (such as from cDNA microarrays). The methodology we will construct will apply equally well in both contexts. However, for clarity in what follows, we will assume that the values Aij in the matrix represent absolute levels and that all entries are non-negative; in our numerical analyses we removed genes that did not satisfy this criterion.
|
A specific assumption in tumor classification is that samples drawn
from a population containing several tumor types have similar
expression profiles if they belong to the same type. Observing several
experiments, each of which has multiple tumor types, suggests a
somewhat stronger assumption; for tumors of the same type there exist
subsets of overexpressed (or underexpressed) genes that are not
similarly overexpressed (or underexpressed) in another tumor type.
Under this assumption, the matrix A could be organized in a
checkerboard-like structure with blocks of high-expression levels and
low-expression levels, as shown in Figure 1. A block of high-expression
levels corresponds to a subset of genes (subset of rows) that are
highly expressed in all samples of a given tumor type (subset of
columns). One of the numerous examples supporting this picture is the
CNS embryonal tumors dataset (Pomeroy et al. 2002). However, this
simple checkerboard-like structure can be confounded by a number of
effects. In particular, different overall expression levels of genes
across all experimental conditions or of samples across all genes in
multiple tumor datasets can obscure the block structure. Consequently,
rescaling and normalizing both the gene and sample dimensions could
improve the clustering and reveal existing latent variables in both the
gene and tumor dimensions.
Uncovering Checkerboard Structures Through Solving an Eigenproblem
In this work, we attempt to simultaneously cluster genes and experimental conditions with similar expression profiles (i.e. to "bicluster" them), examining the extent to which we are able to automatically identify checkerboard structures in cancer datasets. Further, we integrate biclustering with careful normalization of the data matrix in a spectral framework model. This framework allows us to use standard linear algebra manipulations, and the resulting partitions are generated using the whole dataset in a global fashion. The normalization step, which eliminates effects such as differences in experimental conditions and basal expression levels of genes, is designed to accentuate biclusters if they exist.
Figure 1 illustrates the overall idea of our approach. It shows how
applying a checkerboard-structured matrix A to a step-like classification vector for genes (x) results in a step-like
classification vector on conditions (y). Reapplying the
transpose of the matrix AT to this condition
classification vectors results in a step-like gene classification
vector with the same step pattern as input vector x. This suggests that
one might be able to ascertain the checkerboard-like structure of A
through solving an eigenproblem involving AAT. More
precisely, it shows how the checkerboard pattern in a data matrix
A is reflected in the piecewise constant structures of some
pair of eigenvectors x and y that solve the coupled
eigenvalue problems AT
Ax = 2x and AAT
y =
2y (where x and
y have a common eigenvalue). This, in turn, is equivalent to
finding the singular value decomposition of A. Thus, the
simple operation of identifying whether there exists a pair of
piecewise constant eigenvectors allows us to determine whether the data
have a checkerboard pattern. Simple reshuffling of rows and columns
(according to the sorted order of these eigenvectors) then can make the
pattern evident. However, different average amounts of expression
associated with particular genes or conditions can obscure the
checkerboard pattern. This can be corrected by initially normalizing
the data matrix A. We propose a number of different schemes,
all built around the idea of putting the genes on the same scale so
that they have the same average level of expression across conditions,
and likewise for the conditions. A graphic overview of our method (in
application to real data) is shown in Figure 8, where one can see how
the data in matrix A are progressively transformed by normalization and
shuffling to bring out a checkerboard-like signal.
We note that our method implicitly exploits the effect of clustering of experimental conditions on clustering of the genes and vice versa, and it allows us to simultaneously identify and organize subsets of genes whose expression levels are correlated and subsets of conditions whose expression level profiles are correlated.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Technical Background
Data normalization
Preprocessing of microarray data often has a critical impact on the analysis. Several preprocessing schemes have been proposed. For instance, Eisen et al. (1998)Singular Value Decomposition (SVD)
Principal component analysis (PCA; Pearson 1901Normalized Cuts Method
Spectral methods have been used in graph theory to design clustering algorithms. These algorithms were used in various fields (Shi and Malik 1997Previous Work on Biclustering
The idea of simultaneous clustering of rows and columns of a matrix goes back to (Hartigan 1972A Spectral Approach to Biclustering
Our aim is to have coclustering of genes and experimental
conditions in which genes are clustered together if they exhibit similar expression patterns across conditions and, likewise,
experimental conditions are clustered together if they include genes
that are expressed similarly. Interestingly, our model can be reduced
to the analysis of the same eigensystem derived in Dhillon's
formulation for the problem of coclustering of words and documents
(Dhillon 2001). To apply Dhillon's method to microarray data, one can
construct a bipartite graph, where one set of nodes in this graph
represents the genes, and the other represents experimental conditions.
An arc between a gene and condition represents the level of
overexpression (or underexpression) of this gene under this condition.
The bipartite approach is limited in that it can only divide the genes
and conditions into the same number of clusters. This is often
impractical. As described below, our formulation allows the number of
gene clusters to be different from the number of condition clusters.
In addition, Dhillon's optimal partitioning eigenvector has a hybrid structure containing both gene and condition entries, whereas in our approach we search for separate piecewise constant structure of the gene and corresponding sample eigenvectors. Examining Dhillon's and our partitioning approaches using data generated by the generating model discussed below shows the advantage of the latter.
Spectral Biclustering
We developed a method that simultaneously clusters genes and conditions. The method is based on the following two assumptions:1. | Two genes that are coregulated are expected to have correlated expression levels, which might be difficult to observe due to noise. We can obtain better estimates of the correlations between gene expression profiles by averaging over different conditions of the same type. |
2. | Likewise, the expression profiles for every two conditions of the same type are expected to be correlated, and this correlation can be better observed when averaged over sets of genes of similar expression profiles. |
Independent Rescaling of Genes and Conditions
We assume that the data matrix A represents an approximation of the product of these three factors, Eij,Simultaneous Normalization of Genes and Conditions
Because our spectral biclustering approach includes the normalization of rows and columns as an integral part of the algorithm, it is natural to attempt to simultaneously normalize both genes and conditions. As described below, this can be achieved by repeating the procedure described above for independent scaling of rows and columns iteratively until convergence. This process, which we call bistochastization, results in a rectangular matrix B that has a doubly stochastic-like structurePostprocessing the Eigenvectors to Find Partitions
Each of the above normalization approaches (independent scaling, bistochastization, or log interactions) gives rise, after the SVD, to a set of gene and condition eigenvectors (that in the context of microarray analysis are sometimes termed eigengenes and eigenarrays; Hastie et al. 1999Probabilistic Interpretation
In the biclustering approach, the normalization procedure, obtained by constraining the row sums to be equal to one constant and the column sums to be equal to another constant, is an integral part of the modeling that allows us to discern bidirectional structures. This normalization can be cast in probabilistic terms by imagining first choosing a random RNA transcript from all RNA in all samples (conditions), and then choosing one more RNA transcript randomly from the same sample. Here, when we speak of choosing "randomly" we mean that each possible RNA is equally likely to be chosen. Having chosen these two RNAs, we take note of which sample they come from and which genes they express. The matrix entry (R
![]() |
![]() |
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Overall Format of the Results
We have performed a study in which we applied the above spectral
biclustering methods to five groups of cancer microarray data
setslymphoma (microarray and Affymetrix), leukemia, breast cancer,
and central nervous system embryonal tumors. As explained above, we
utilized SVD to find pairs of piecewise constant eigenvectors of genes
and conditions, that reflect the degree to which the data can be
rearranged in a checkerboard structure. Our methods employ specific
normalization schemes that highlight the similarity of both gene and
condition eigenvectors to piecewise constant vectors, and this
similarity, in turn, directly reflects the degree of biclustering. To
assess our procedure, it is useful to see how well it compares to
several benchmarks, with respect to achieving the goal of piecewise
constant eigenvectors.
Our main results are presented in Figures 3-7. These show consistently
formatted graphs of the projection of each dataset onto the best two
eigenvectors. Each figure is laid out in six panels, with the first two
panels associated with our biclustering methods and the next four
panels showing the benchmarks. In particular:
Panel a Bistochastization shows biclustering using the bistochastic normalization. | |
Panel b Biclustering shows standard biclustering with independent rescaling of rows and columns. | |
Panel c SVD shows SVD applied to the raw data matrix A. | |
Panel d Binormalization shows SVD applied to a
transformed matrix obtained by first rescaling its columns by
their means and then standardizing the rows of the rescaled matrix as
proposed in Getz et al. (2000)![]() |
|
Panel e Normalized cuts shows a normalized
cuts benchmark.Here we apply the normalized cuts algorithm
using an affinity matrix obtained from a distance matrix, which, in
turn, was derived by calculating the norms of the differences between
the standardized columns of A as proposed in Xing and Karp
(2001)![]() ![]() |
|
Panel f Log-interaction shows SVD applied to a matrix where the raw expression data is substituted by the matrix K described above. |
Overall, by comparing the six panels in each of the five
different figures, we see that in the bistochastization method (panel a) the distributions of the different samples have no or
minimal overlap between clusters as well as more tendency to result in more compact clusters. The biclustering method (panel b)
results in slightly less separable clusters, but it tends to separate the clusters along a single eigenvector. Straight SVD of the different raw data (panel c) underperforms in comparison to our spectral methods, as can be seen from the intermingled distributions of tumors
of different types or less distinct clusters. Performing instead SVD on
the log-interaction matrix of the raw expression data tends to produce
results that are similar to those obtained with bistochastization
(panel f). SVD of the column-rescaled row-standardized matrix
(Getz et al. 2000) and the normalized cut method result in better
partitioning than SVD of the raw data (panels d and e). However, in general, our spectral methods consistently
perform well.
In the following sections we discuss each of the five datasets in detail.
Lymphoma Microarray Dataset
We first applied the methods to publicly available lymphoma
microarray data: chronic lymphocytic leukemia (CLL), diffuse large B-cell lymphoma (DLCL), and follicular lymphoma (FL). The clustering results are shown in Figures 2 and
3. In both cases when we used the doubly
stochastic-like matrix B or the biclustering method (C1ATR
1A)
of the lymphoma dataset, we obtained the desired partitioning of
patients in the second largest eigenvectors. The sorted eigenvectors give not only a partition of patients, but also an internal ranking of
patients within a given disease. In addition, the outer product of the
gene and tumor (sorted) eigenvectors allows us to observe which genes
induce a partition of patients and vice versa. This can be seen in
Figure 2. Dividing the eigenvector that corresponds to the second
largest eigenvalue (in both methods) using the k-means algorithm (which
is equivalent to fitting a piecewise constant vector to each of the
eigenvectors) led to a clean partition between the DLCL patients and
the patients with other diseases. This is highlighted in the header of
Figure 2 and the x-axis of Figure 3a,b. The published analysis
did not cluster two of the DLCL cases correctly (Alizadeh et al. 2000
).
Further partitioning of the CLL and the FL patients is obtained by
using both the second- and third-largest eigenvectors. To divide the
data we applied a recursive, two-way clustering using the normalized
cuts algorithm to a two-column matrix composed of the 2nd and 3rd
eigenvectors of both matrices. (Performing a final clustering step to
the data projected to a small number of eigenvectors is a common
practice in spectral clustering.) Using the biclustering matrix with
independent row and column normalizations, the patients were correctly
divided, with the exception of two of the CLL patients, who were
clustered together with the FL patients. The best partition was
obtained using our doubly stochastic matrix that divided the patients
perfectly according to the three types of diseases.
|
|
Lymphoma Affymetrix Dataset
The above lymphoma data were generated by microarray technology that
provides relative measurements of expression data. We repeated the
lymphoma analysis using data from a study relating B-CLL to memory B
cells (Klein et al. 2001). These data were generated using Affymetrix
U95A gene chips, which presumably allow measurements proportional to
absolute mRNA levels. We selected samples taken from CLL, FL, and DLCL
patients, but in addition we also included samples from DLCL cell
lines. As can be seen in Figure 4a,b, the bistochastization method cleanly separates the four different sample
types, and the biclustering separates these samples except for one DLCL
sample that slightly overlaps with the FL distribution. We note that
the DLCL patient expression patterns are closer to those of the FL
patients than to the expression profiles of the DLCL cell lines (and
pg|g(DLCL|FL) > pg|g(DLCL|DLCL-cell
lines).
|
Leukemia Dataset
We applied our methods to public microarray data of acute leukemia
(B- and T-cell acute lymphocytic leukemia [ALL] and acute myelogenous
leukemia [AML]). The patient distributions of the different diseases
of the leukemia dataset become separated in the two-dimensional graphs
generated by projecting the patient expression profiles onto the 2nd
and 3rd gene class partition vectors of the biclustering method (Fig.
5b). The bistochastic method also
partitions the patients well, with only one ambiguous case that is
close to the boundary between ALL and AML (Fig. 5a). Application of
k-means to a matrix composed of the 2nd and 3rd biclustering
eigenvectors results in three misclassifications, which is a slight
improvement over the four misclassifications reported by Golub et al.
(1999). Further partitioning of the ALL cases is obtained by applying a
normalized cuts clustering method to the biclustering eigenvectors, and
produces a clear separation between T- and B-cell ALL. This is a slight
improvement over published results (two misclassifications; Golub et
al. 1999
; Getz et al. 2000
). Another advantage over their methods is
that biclustering does not require specification of the number of
desired clusters or lengthy searches for subsets of genes.
|
Dataset From Breast Cell Lines Transfected With the CSF1R Oncogene
In another microarray experiment study (Kluger et al. 2001), an
oncogene encoding a transmembrane tyrosine kinase receptor was mutated
at two different phosphorylation sites. Benign breast cells were
transfected with the wild-type oncogene, creating a phenotype that
invades and metastasizes. The benign cell line was then transfected
with the two mutated oncogenes, creating one phenotype that invades and
another one that metastasizes. RNA expression levels were measured
eight times for each phenotype. Transfection with a single oncogene is
expected to generate similar expression profiles, presumably because
only a few genes are biologically influenced. Therefore, it was
desirable to see whether profiles of the different phenotypes can be partitioned.
Figure 8 allows us to examine the extent to which the data can be
arranged in a checkerboard pattern. This is done by taking the outer
product of the cell type-sorted eigenvector that has the most
stepwise-like structure (and is associated with the first largest
singular value) with the corresponding gene-sorted eigenvector. Due to
noise in the data and similarity between the different samples, common
clustering techniques such as hierarchical, k-means, and medoids did
not succeed in cleanly partitioning the data, but the relevant
eigen-array obtained following bistochastization or log-interaction
normalization partitioned the samples perfectly. Expression levels of
the four cell lines were measured in two separate sets of four
measurements. We chose to measure the ratio of three of the cell lines:
benign (a), invasive (c), and metastatic (d)
with respect to the cell line that invades and metastasizes (b) in the first batch, and the corresponding ratios were
similarly derived for the second batch. In
Figure
8, the ratios from the first and second batches are denoted by
(a, c, d) and (A, C,
D), respectively. As can be seen, the simultaneous
normalization methods partition the data such that all the phenotypes
are separated into clustersthat is, "a"s were
clustered with "A"s in one group, "c"s with
"C"s in another group, and "d"s with
"D"s in yet another group, as expected. Further
exploration is required in order to relate those gene clusters to
biological pathways that are relevant to these conditions.
|
|
|
Central Nervous System Embryonal Tumor Dataset
Finally, we analyzed the recently published CNS embryonal tumor
dataset (Pomeroy et al. 2002): Pomeroy et al. partitioned these five
tumor types using standard principal component analysis, but did so
after employing a preselection of genes exhibiting variation across the
data set (see Fig. 1b in Pomeroy et al. 2002
). Using all
genes, we find that the bistochastization method, and to a lesser
degree the biclustering method, partitioned the medulloblastoma, malignant glioma, and normal cerebella tumors. As can be seen in Figure
7, the remaining rhabdoid tumors are more widely scattered in the
subspace obtained by projecting the tumors onto the 2nd-4th gene
partitioning eigenvectors of the biclustering and bistochastization methods. Nonetheless, the rhabdoid tumor distribution does not overlap
with the other tumor distributions if we use the bistochastization method. The primitive neuro-ectodermal tumors (PNETs) did not cluster
and were difficult to classify even using supervised methods.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Unsupervised clustering of genes and experimental conditions in
microarray data can potentially reveal genes that participate in
cellular mechanisms that are involved in various diseases. In this
paper we present a spectral biclustering method that utilizes the
information gained by clustering the conditions to facilitate the
clustering of genes, and vice versa. The method incorporates a closely
integrated normalization. It also naturally discards the irrelevant
constant background, such that no additional arguments are
needed to ignore the contribution associated with the largest eigenvalue, as advocated in Alter et al. (2000). In particular, our
method is designed to cluster populations of different tumors assuming
that each tumor type has a subset of marker genes that exhibit
overexpression and that typically are not overexpressed in other
tumors. The main underlying assumption is that we can simultaneously
obtain better tumor clusters and gene clusters by correlating genes
averaged over different samples of the same tumors. Likewise, the
correlation of two tumors is more apparent when averaged over sets of
genes of similar expression profiles. In situations where the number of
tumor types (the number of clusters of experimental conditions) happens
to be equal to the number of typical gene profiles (the number of gene
clusters), the biclustering algorithm is related to the modified
normalized cuts objective function introduced by Dhillon (2001)
. In
addition, in a situation where the data have approximately a
checkerboard structure with more than two clusters on each side, there
may be several eigenvectors indicating a partitioning. In this case we
may be able to determine the number of clusters by identifying all of
these eigenvectors, for example, using a pairwise measure such as
mutual entropy between all pairs of eigenvectors.
The methods presented in this paper, particularly those incorporating
simultaneous normalization of rows and columns, show consistent
advantage over SVD spectral analysis of the raw data, the logarithm of
the raw data, other forms of rescaling transformations of the raw data,
and the normalized cuts partitioning of the raw or rescaled data.
Nevertheless, our partitioning results are not perfect. Better results
may be obtained by employing a generative model that better suits the
data. It has been shown that removal of irrelevant genes that introduce
noise can further improve clustering (as in Xing and Karp 2001).
Furthermore, if partitioning in the gene dimension is sharper than
partitioning in the condition dimension or vice versa, we can organize
the conditions or genes of the blurrier dimension contiguously. Such
arrangements perhaps give one a sense of the progression of disease
states or relevance of a gene to a particular disease.
![]() |
ACKNOWLEDGMENTS |
---|
Y.K. is supported by the Cancer Bioinformatics Fellowship from the Anna Fuller Fund, and M.G. acknowledges support from Human Genome array: Technology for Functional Analysis (an NIH grant number P50 HG02357-01).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
![]() |
FOOTNOTES |
---|
6 Corresponding author.
E-MAIL genomeresearch@bioinfo.mbb.yale.edu; FAX (360) 838-7861.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.648603.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Received July 22, 2002; accepted in revised form January 28, 2003.
|