Genome Research -- Kluger et al. 13 (4): 703

Institution: Yale University Sign In as Individual

Abstract of this Article (

)

Reprint (PDF) Version of this Article

Email this article to a friend

METHODS
Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions

Yuval Kluger,¹^,² Ronen Basri,³ Joseph T. Chang,⁴ and Mark Gerstein²^,⁵^,⁶

¹ Department of Genetics, Yale University, New Haven, Connecticut 06520, USA; ² Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA; ³ Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel; ⁴ Department of Statistics, Yale University, New Haven, Connecticut 06520, USA; ⁵ Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA

ABSTRACT

TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

	ABSTRACT

Global analyses of RNA expression levels are useful for classifying genes and overall phenotypes. Often these classificationproblems are linked, and one wants to find "marker genes" thatare differentially expressed in particular sets of "conditions."We have developed a method that simultaneously clusters genesand conditions, finding distinctive "checkerboard" patterns inmatrices of gene expression data, if they exist. In a cancer context,these checkerboards correspond to genes that are markedly up-or downregulated in patients with particular types of tumors.Our method, spectral biclustering, is based on the observationthat checkerboard structures in matrices of expression data canbe found in eigenvectors corresponding to characteristic expressionpatterns across genes or conditions. In addition, these eigenvectorscan be readily identified by commonly used linear algebra approaches,in particular the singular value decomposition (SVD), coupledwith closely integrated normalization steps. We present a numberof variants of the approach, depending on whether the normalizationover genes and conditions is done independently or in a coupledfashion. We then apply spectral biclustering to a selection ofpublicly available cancer expression data sets, and examine thedegree to which the approach is able to identify checkerboardstructures. Furthermore, we compare the performance of our biclusteringmethods against a number of reasonable benchmarks (e.g., directapplication of SVD or normalized cuts to rawdata).

	INTRODUCTION

TOP ABSTRACT INTRODUCTION METHODS RESULTS DISCUSSION REFERENCES

Microarray Analysis to Classify Genes and Phenotypes

Microarray experiments for simultaneously measuring RNA expression levels of thousands of genes are becoming widely used ingenomic research. They have enormous promise in such areas asrevealing function of genes in various cell populations, tumorclassification, drug target identification, understanding cellularpathways, and prediction of outcome to therapy (Brown and Botstein1999; Lockhart and Winzeler 2000). A major application of microarraytechnology is gene expression profiling to predict outcome inmultiple tumor types (Golub et al. 1999). In a bioinformaticscontext, we can apply various data-mining methods to cancer datasetsin order to identify class distinction genes and to classify tumors.A partial list of methods includes: (1) data preprocessing (backgroundelimination, identification of differentially expressed genes,and normalization); (2) unsupervised clustering and visualizationmethods (hierarchical, SOM, k-means, and SVD); (3) supervisedmachine learning methods for classification based on prior knowledge(discriminant analysis, support-vector machines, decision trees,neural networks, and k-nearest neighbors); and (4) more ambitiousgenetic network models (requiring large amounts of data) thatare designed to discover biological pathways using such approachesas pairwise interactions, continuous or Boolean networks (basedon a system of coupled differential equations), and probabilisticgraph modeling based on Bayesian networks (Tamayo et al. 1999;Brown et al. 2000; Friedman et al. 2000).

Our focus here is on unsupervised clustering methods. Unsupervised techniques are useful when labels are unavailable. Examplesinclude attempts to identify (yet unknown) subclasses of tumors,or work on identifying clusters of genes that are coregulatedor share the same function (Brown et al. 2000; Mateos et al. 2002).Unsupervised methods have been successful in separating certaintypes of tumors associated with different types of leukemia andlymphoma (Golub et al. 1999; Alizadeh et al. 2000; Klein et al.2001). However, unsupervised (and even supervised) methods havehad less success in partitioning the samples according to tumortype or outcome in diseases with multiple subclassifications (Pomeroyet al. 2002; van't Veer et al. 2002). In addition, the methodswe propose here are related to a method of Dhillon (2001) forcoclustering of words anddocuments.

Checkerboard Structures of Genes and Conditions in Microarray Datasets

As a starting point in analyzing microarray cancer datasets, it is worthwhile to appreciate the assumed structure of thesedata (e.g., whether they can be organized in a checkerboard pattern),and to design a clustering algorithm that is suitable for thisstructure. In particular, in analyzing microarray cancer datasets we may wish to identify both clusters of genes that participatein common regulatory networks and clusters of experimental conditionsassociated with the effects of these genes, for example, clustersof cancer subtypes. In both cases we may want to use similaritiesbetween expression level patterns to determine clusters. Clearly,advance knowledge of clusters of genes can help in clusteringexperimental conditions, and vice versa. In the absence of knowledgeof gene and condition classes, it would be useful to develop partitioningalgorithms that find latent classes by exploiting relations betweengenes and conditions. Exploiting the underlying two-sided datastructure could help the simultaneous clustering, leading to meaningfulgene and experimental conditionclusters.

The raw data in many cancer gene-expression datasets can be arranged in a matrix form as schematized in Figure 1. In thismatrix, which we denote by A, the genes index rows i and the differentconditions (e.g., different patients) index the columns j. Dependingon the type of chip technology used, a value in this matrix A_ijcould either represent absolute expression levels (such as fromAffymetrix GeneChips) or relative expression ratios (such as fromcDNA microarrays). The methodology we will construct will applyequally well in both contexts. However, for clarity in what follows,we will assume that the values A_ij in the matrix represent absolutelevels and that all entries are non-negative; in our numericalanalyses we removed genes that did not satisfy this criterion.

View larger version (95K):
[in this window]
[in a new window]

Figure 1 Overview of important parts of the biclustering process. (A) shows the problem: shuffling a gene expression matrix to reveal a checkerboard pattern associating genes with conditions. (B) shows how this problem can be approached through solving an "eigenproblem." If a gene expression matrix A has a checkerboard structure, applying it to a step-like condition classification vector x will result in a step-like gene classification vector y. Moreover, if one then applies A^T to y, one will regenerate a step-like condition classification vector with the same partitioning structure as x. This suggests one can determine whether A has a checkerboard structure through solving an eigenvalue problem. In other words, if A has a (hidden) checkerboard structure, there exist some piecewise constant partition vectors x = v_* and y = u_* such that A^T Av_* = $lambda$ ²v_* and AATu_* = $lambda$ ²u_* (bottom quadrant of part B). Note that most eigenvectors v of the eigenvalue problem A^T Av = $lambda$ ²v (symbolized by a zigzag structure) are not embedded in the subspace of classification (step-like) vectors x possessing the same partitioning structure, as indicated by a gray arrow protruding from this subspace (parallelogram). On the other hand, piecewise constant (step-like) partition eigenvectors v_* are embedded in this subspace and are indicated by a green arrow. To reveal whether the data have a checkerboard structure, one can inspect whether some of the pairs of monotonically sorted gene and tumor eigenvectors v_i and u_i have an approximate stepwise (piecewise) constant structure. The outer product u_*v_*^T of the sorted partitioning eigenvectors gives a checkerboard structure. (C) shows how rescaling of matrix A can lead to improved copartitioning of genes and conditions.

A specific assumption in tumor classification is that samples drawn from a population containing several tumor types havesimilar expression profiles if they belong to the same type. Observingseveral experiments, each of which has multiple tumor types, suggestsa somewhat stronger assumption; for tumors of the same type thereexist subsets of overexpressed (or underexpressed) genes thatare not similarly overexpressed (or underexpressed) in anothertumor type. Under this assumption, the matrix A could be organizedin a checkerboard-like structure with blocks of high-expressionlevels and low-expression levels, as shown in Figure 1. A blockof high-expression levels corresponds to a subset of genes (subsetof rows) that are highly expressed in all samples of a given tumortype (subset of columns). One of the numerous examples supportingthis picture is the CNS embryonal tumors dataset (Pomeroy et al.2002). However, this simple checkerboard-like structure can beconfounded by a number of effects. In particular, different overallexpression levels of genes across all experimental conditionsor of samples across all genes in multiple tumor datasets canobscure the block structure. Consequently, rescaling and normalizingboth the gene and sample dimensions could improve the clusteringand reveal existing latent variables in both the gene and tumordimensions.

Uncovering Checkerboard Structures Through Solving an Eigenproblem

In this work, we attempt to simultaneously cluster genes and experimental conditions with similar expression profiles (i.e.to "bicluster" them), examining the extent to which we are ableto automatically identify checkerboard structures in cancer datasets.Further, we integrate biclustering with careful normalizationof the data matrix in a spectral framework model. This frameworkallows us to use standard linear algebra manipulations, and theresulting partitions are generated using the whole dataset ina global fashion. The normalization step, which eliminates effectssuch as differences in experimental conditions and basal expressionlevels of genes, is designed to accentuate biclusters if theyexist.

Figure 1 illustrates the overall idea of our approach. It shows how applying a checkerboard-structured matrix A to a step-likeclassification vector for genes (x) results in a step-like classificationvector on conditions (y). Reapplying the transpose of the matrixA^T to this condition classification vectors results in a step-likegene classification vector with the same step pattern as inputvector x. This suggests that one might be able to ascertain thecheckerboard-like structure of A through solving an eigenprobleminvolving AA^T. More precisely, it shows how the checkerboard pattern in a datamatrix A is reflected in the piecewise constant structures ofsome pair of eigenvectors x and y that solve the coupled eigenvalueproblems A^T Ax = $lambda$ ²x and AA^T y = $lambda$ ²y (where x and y have a common eigenvalue). This, in turn, isequivalent to finding the singular value decomposition of A. Thus,the simple operation of identifying whether there exists a pairof piecewise constant eigenvectors allows us to determine whetherthe data have a checkerboard pattern. Simple reshuffling of rowsand columns (according to the sorted order of these eigenvectors)then can make the pattern evident. However, different averageamounts of expression associated with particular genes or conditionscan obscure the checkerboard pattern. This can be corrected byinitially normalizing the data matrix A. We propose a number ofdifferent schemes, all built around the idea of putting the geneson the same scale so that they have the same average level ofexpression across conditions, and likewise for the conditions.A graphic overview of our method (in application to real data)is shown in Figure 8, where one can see how the data in matrixA are progressively transformed by normalization and shufflingto bring out a checkerboard-likesignal.

We note that our method implicitly exploits the effect of clustering of experimental conditions on clustering of the genesand vice versa, and it allows us to simultaneously identify andorganize subsets of genes whose expression levels are correlatedand subsets of conditions whose expression level profiles arecorrelated.

	METHODS

TOP ABSTRACT INTRODUCTION METHODS RESULTS DISCUSSION REFERENCES

Technical Background

Data normalization

Preprocessing of microarray data often has a critical impact on the analysis. Several preprocessing schemes have been proposed.For instance, Eisen et al. (1998)

prescribes the following seriesof operations: Take the log of the expression data, perform 5-10cycles of subtracting either the mean or the median of the rows(genes) and columns (conditions), and then do 5-10 cycles of row-columnnormalization. In a similar fashion, Getz et al. (2000)

firstrescale the columns by their means and then standardize the rowsof the rescaled matrix. The motivation is to remove systematicbiases in expression ratios or absolute values that are the resultof differences in RNA quantities, labeling efficiency and imageacquisition parameters, as well as adjusting gene levels relativeto their average behavior. Different normalization prescriptionscould lead to different partitions of the data. Choice of a normalizationscheme that is designed to emphasize underlying data structuresor is rigorously guided by statistical principles is desirablefor establishing standards and for improving reproducibility ofresults from microarrayexperiments.

Singular Value Decomposition (SVD)

Principal component analysis (PCA; Pearson 1901

) is widely used to project multidimensional data to a lower dimension. PCAdetermines whether we can comprehensively present multidimensionaldata in d dimensions by inspecting whether d linear combinationsof the variables capture most of the data variability. The principalcomponents can be derived by using singular value decomposition,or "SVD" (Golub and Van Loan 1983

), a standard linear algebratechnique that expresses a real n × m matrix A as a product A= U $Lambda$ V^T, where $Lambda$ is a diagonal matrix with decreasing non-negative entries,and U and V are n × min(n,m) and m × min(n,m) orthonormal columnmatrices. The columns of the matrices U and V are eigenvectorsof the matrices AA^T and A^T A, respectively, and the nonvanishing entries $lambda$ ₁ $>=$ $lambda$ ₂ $>=$ ... >0in the matrix $Lambda$ are square roots of the non-zero eigenvalues ofAA^T (and also of A^T A). Below we will denote the ith columns of the matrices U andV by u_i and v_i, respectively. The vectors u_i and v_i are calledthe singular vectors of A, and the values $lambda$ _i are calledthe singular values. The SVD has been applied to microarray experimentanalysis in order to find underlying temporal and tumor patterns(Alter et al. 2000

; Holter et al. 2000

; Raychaudhuri et al. 2000

;Lian et al. 2001

Normalized Cuts Method

Spectral methods have been used in graph theory to design clustering algorithms. These algorithms were used in various fields(Shi and Malik 1997

), including for microarray data partitioning(Xing and Karp 2001

). A commonly used variant is called the normalizedcuts algorithm. In this approach the items (nodes) to be clusteredare represented as the vertex set V. The degree of similarity(affinity) between each two nodes is represented by a weight matrixw_ij. For example, the affinity between two genes may be definedbased on the correlation between their expression profiles overall experiments. The vertex set V together with the edges e_ij $is in$ E and their corresponding weights w_ij define a complete graphG(V,E) that we want to segment. Clustering is achieved by solvingan eigensystem that involves the affinity matrix. These methodswere applied in the field of image processing, and have demonstratedgood performance in problems such as image segmentation. Nevertheless,spectral methods in the context of clustering are not well understood(Weiss 1999

). We note that the singular values of the originaldataset represented in the matrix A are related to the eigenvaluesor generalized eigenvalues of the affinity matrices A^T A and AA^T. These matrices represent similarities between genes and similaritiesbetween conditions,respectively.

Previous Work on Biclustering

The idea of simultaneous clustering of rows and columns of a matrix goes back to (Hartigan 1972

). Methods for simultaneousclustering of genes and conditions were more recently proposed(Cheng and Church 2000

; Getz et al. 2000

; Lazzeroni and Owen 2002

).The goal was to find homogeneous submatrices or stable clustersthat are relevant for biological processes. These methods applygreedy iterative search to find interesting patterns in the matrices,an approach that is also common in one-sided clustering (Hastieet al. 2000

; Stolovitzky et al. 2000

). In contrast, our approachis more "global," finding biclusters using all columns androws.

Another statistically motivated biclustering approach has been tested for collaborative filtering of nonbiological data (Ungarand Foster 1998

; Hofmann and Puzicha 1999

). In this approach,probabilistic models were proposed in which matrix rows (genesin our case) and columns (experimental conditions) are each dividedinto clusters, and there are link probabilities between theseclusters. These link probabilities can describe the associationbetween a gene cluster and an experimental condition cluster,and can be found by using iterative Gibbs sampling and approximatedExpectation Maximization algorithms (Ungar and Foster 1998

; Hofmannand Puzicha 1999

A Spectral Approach to Biclustering

Our aim is to have coclustering of genes and experimental conditions in which genes are clustered together if they exhibitsimilar expression patterns across conditions and, likewise, experimentalconditions are clustered together if they include genes that areexpressed similarly. Interestingly, our model can be reduced tothe analysis of the same eigensystem derived in Dhillon's formulationfor the problem of coclustering of words and documents (Dhillon2001). To apply Dhillon's method to microarray data, one can constructa bipartite graph, where one set of nodes in this graph representsthe genes, and the other represents experimental conditions. Anarc between a gene and condition represents the level of overexpression(or underexpression) of this gene under this condition. The bipartiteapproach is limited in that it can only divide the genes and conditionsinto the same number of clusters. This is often impractical. Asdescribed below, our formulation allows the number of gene clustersto be different from the number of conditionclusters.

In addition, Dhillon's optimal partitioning eigenvector has a hybrid structure containing both gene and condition entries,whereas in our approach we search for separate piecewise constantstructure of the gene and corresponding sample eigenvectors. ExaminingDhillon's and our partitioning approaches using data generatedby the generating model discussed below shows the advantage ofthelatter.

Spectral Biclustering

We developed a method that simultaneously clusters genes and conditions. The method is based on the following two assumptions:

1.	Two genes that are coregulated are expected to have correlated expression levels, which might be difficult to observe dueto noise. We can obtain better estimates of the correlations betweengene expression profiles by averaging over different conditionsof the sametype.
2.	Likewise, the expression profiles for every two conditions of the same type are expected to be correlated, and this correlationcan be better observed when averaged over sets of genes of similarexpressionprofiles.

These assumptions are supported by simple analyses of a variety of typical microarray sets. For example, Pomeroy et al. (2002)

presented a dataset on five types of brain tumors, and then useda supervised learning procedure to select genes that were highlycorrelated with class distinction. They based this work on theabsolute expression levels of genes in 42 samples taken from thesefive types of tumors. Using these data, we measured the correlationbetween the expression levels of genes that are highly expressedin only one type of tumor, and found only moderate levels of correlation.However, if we instead average the expression levels of each geneover all samples of the same tumor type (obtaining vectors withfive entries representing the averages of the five types of tumors),the partition of the genes based on correlation between the five-dimensionalvectors is moreapparent.

This dataset well fits the specifications of our approach, which is geared to finding a "checkerboard-like structure," indicatingthat for each type of tumor there may be few characteristic subsetsof genes that are either upregulated or downregulated. To understandour method (Fig. 1), consider a situation in which an underlyingclass structure of genes and of experimental conditions exists.We model the data as a composition of blocks, each of which representsa gene-type-condition-type pairing, but the block structure isnot immediately evident. Mathematically, the expression levelof a specific gene i under a certain experimental condition jcan be expressed as a product of three independent factors. Thefirst factor, which we called the hidden base expression level,is denoted by E_ij. We assume that the entries of E within eachblock are constant. The second factor, denoted $rho$ _i, representsthe tendency of gene i to be expressed under all experimentalconditions. The last factor, denoted $chi$ _j, represents theoverall tendency of genes to be expressed under condition j. Weassume the microarray expression data to be a noisy version ofthe product of these threefactors.

Independent Rescaling of Genes and Conditions

We assume that the data matrix A represents an approximation of the product of these three factors, E_ij, $rho$ _i,and $chi$ _j. Our objective in the simultaneous clustering of genesand conditions is, given A, to find the underlying block structureof E. Consider two genes, i and k, which belong to a subset ofsimilar genes. On average, according to this model, their expressionlevels under each condition should be related by a factor of $rho$ _i/ $rho$ _k.Therefore, if we normalize the two rows, i and k, in A, then onaverage they should be identical. The similarity between the expressionlevels of the two genes should be more noticeable if we take themean of expression levels with respect to all conditions of thesame type. This will lead to an eigenvalue problem, as is shownnext. Let R denote a diagonal matrix whose elements r_i (wherei=1,...,n) represent the row sums of A [R = diag(A·1_n),1_n denotes the n-vector (1,...,1)]. Let u = (u¹,u²,..., u^m) denote a "classification vector" of experimental conditions,so that u is constant over all conditions of the same type. Forinstance, if there are two types of conditions, then u^j = $alpha$ for each condition j of the first type and u^j = $beta$ for each condition j of the second type. In other words,if we reorder the conditions such that all conditions of the firsttype appear first, then u = ( $alpha$ ,..., $alpha$ , $beta$ ,... $beta$ ). Then, v = R¹Au is an estimate of a "gene classification vector," that is,a vector whose entries are constant for all genes of the sametype (e.g., if there are two types of genes, then v_i= $gamma$ for eachgene i of the first type and v_i= $delta$ for each gene i of the secondtype). By multiplying by R¹ from the left. we normalize the rows of A, and by applying thisnormalized matrix to u, we obtain a weighted sum of estimatesof the mean expression level of every gene i under every typeof experimental condition. When a hidden block structure existsfor every pair of genes of the same type, these linear combinationsare estimates of the samevalue.

The same reasoning applies to the columns. If we now apply C¹A^Tv, where C is the diagonal matrix whose components are the columnsums of A[C = diag(1 ·A)], we obtain for each experimental condition j a weighted sumof estimates of the mean expression level of genes of the sametype. Consequently, the result of applying the matrix C¹A^T R¹A to a condition classification vector, v, should also be a conditionclassification vector. We will denote this matrix by M₁. M₁ hasa number of characteristics: it is positive semidefinite, it hasonly real non-negative eigenvalues, and its dominant eigenvectoris (1/ $radical$ m)1_m with eigenvalue 1. Moreover, assumingE has linearly independent blocks, its rank is at least min(n_r,n_c),where n_r denotes the number of gene classes and n_cdenotes the number of experimental condition classes. (In generalthe rank would be higher due to noise.) Note that for data withn_c classes of experimental conditions, the set of allclassification vectors spans a linear subspace of dimension n_c.(This is because a classification vector may have a differentconstant value for each of the n_c types of experimentalconditions.) Therefore, there exists at least one vector thatsatisfies M₁u = $lambda$ u. (In fact, there are exactly min(n_r,n_c)such vectors). One of these eigenvectors is the trivial vector(1/ $radical$ m)1_m. Similarly, there exists at least onegene classification vector that satisfies M₂v = $lambda$ v, with M₂ =R¹AC¹A^T. (Note that M₁ and M₂ have the same sets of eigenvalues suchthat if M₁u = $lambda$ u then M₂v = $lambda$ v with v = R¹Au.) These classification vectors can be estimated by solvingthe two eigensystems above. A roughly piecewise constant structurein the eigenvectors indicates the clusters of both genes and conditionsin thedata.

These two eigenvalue problems can be solved through a standard SVD of the rescaled matrix Â $triple-bond$ R^1/2 AC^1/2, realizing that the equation Â^T Âw $triple-bond$ C^1/2A^TR¹AC^1/2w = $lambda$ w that is used to find the singular values of Â is equivalentto the above eigenvalue problem C¹A^T R¹Au= $lambda$ u with u $triple-bond$ C^1/2w (and similarly ÂÂ^Tz $triple-bond$ R^1/2AC¹A^TR^1/2z = $lambda$ z implies v $triple-bond$ R^1/2z). The outer product l_nl,which is a matrix containing only entries of one, is the contributionof the first singular value to the rescaled matrix Â. Thus, thefirst eigenvalue contributes a constant background to both thegene and the experimental condition dimensions, and thereforeits effect should be eliminated. Note that although our methodis defined through a product of A and A^T it does not imply that we multiply the noise, as is evident fromthe SVDinterpretation.

Simultaneous Normalization of Genes and Conditions

Because our spectral biclustering approach includes the normalization of rows and columns as an integral part of the algorithm,it is natural to attempt to simultaneously normalize both genesand conditions. As described below, this can be achieved by repeatingthe procedure described above for independent scaling of rowsand columns iteratively untilconvergence.

This process, which we call bistochastization, results in a rectangular matrix B that has a doubly stochastic-like structure $---$ allrows sum to a constant and all columns sum to a different constant.According to Sinkhorn's theorem, B can then be written as a productB = D₁AD₂ where D₁ and D₂ are diagonal matrices (Bapat and Raghavan1997

). Such a matrix B exists under quite general conditions onA; for example, it is sufficient for all of the entries in A tobe positive. In general, B can be computed by repeated normalizationof rows and columns (with the normalizing matrices as R¹ and C¹ or R^1/2 and C^1/2). D₁ and D₂ then will represent the product of all these normalizations.Fast methods to find D₁ and D₂ include the deviation reductionand balancing algorithms (Bapat and Raghavan 1997

). Once D₁ andD₂ are found, we apply SVD to B with no further normalizationto reveal a blockstructure.

We have also investigated an alternative to bistochastization that we call the log-interactions normalization. A common anduseful practice in microarray analysis is transforming the databy taking logarithms. The resulting transformed data typicallyhave better distributional properties than the data on the originalscale $---$ distributions are closer to Normal, scatterplots are moreinformative, and so forth. The log-interactions normalizationmethod begins by calculating the logarithm L_ij = log(A_ij)of the given expression data and then extracting the interactionsbetween the genes and the conditions, where the term "interaction"is used as in the analysis of variance(ANOVA).

As above, the log-interactions normalization is motivated by the idea that two genes whose expression profiles differ onlyby a multiplicative constant of proportionality are really behavingin the same way, and we would like these genes to cluster together.In other words, after taking logs, we would like to consider twogenes whose expression profiles differ by an additive constantto be equivalent. This suggests subtracting a constant from eachrow so that the row means each become 0, in which case the expressionprofiles of two genes that we would like to consider equivalentactually become the same. Likewise, the same idea holds for theconditions (columns of the matrix). Constant differences in thelog expression profiles between two conditions are consideredunimportant, and we subtract a constant from each column so thatthe column means become 0. It turns out that these adjustmentsto the rows and columns of the matrix to achieve row and columnmeans of zero can all be done simultaneously by a simple formula.Defining L_i. = (1/m) $Sigma$ L_ij to be the average of the ith row, L_.j= (1/n) $Sigma$ L_ij to bethe average of the jth column, and L_.. = (1/mn) $Sigma$ $Sigma$ to be the average of the whole matrix, the result of these adjustmentsis a matrix of interactions K = (K_ij), calculated bythe formula K_ij = L_ij $-$ L_i. $-$ L_.j+ L_... This formula is familiar from the study of two-wayANOVA, from which the terminology of "interactions" is adopted.The interaction K_ij between gene i and condition j capturesthe extra (log) expression of gene i in condition j that is notexplained simply by an overall difference between gene i and othergenes or between condition j and other conditions, but ratheris special to the combination of gene i with condition j. Again,as described before, we apply the SVD to the matrix K to revealblock structure in theinteractions.

The calculations to obtain the interactions are simpler than bistochastization, as they are done by a simple formula withno iteration. In addition, in this normalization the first singulareigenvectors u₁ and v₁ may carry important partitioning information.Therefore we do not automatically discard them as was done inthe previously discussed normalizations. Finally, we note anotherconnection between matrices of interactions and matrices resultingfrom bistochastization. Starting with a matrix of interactionsK, we can produce a bistochastic matrix simply by adding a constantto K.

Postprocessing the Eigenvectors to Find Partitions

Each of the above normalization approaches (independent scaling, bistochastization, or log interactions) gives rise, afterthe SVD, to a set of gene and condition eigenvectors (that inthe context of microarray analysis are sometimes termed eigengenesand eigenarrays; Hastie et al. 1999

; Alter et al. 2000

). Now inthis section, we deal with the issues of how to interpret thesevectors. First recall that in the case of the first two normalizationswe discussed (the independent and bistochastic rescaling), wediscard the largest eigenvalue, which is trivial in the sensethat its eigenvectors make a trivial constant contribution tothe matrix, and therefore carry no partitioning information. Inthe case of the log-interactions normalization, there is no eigenvaluethat is trivial in this sense. We will use the terminology "largesteigenvalue" to mean the largest nontrivial eigenvalue, which,for example, is the second largest eigenvalue for the independentand bistochastic normalizations, whereas it is the largest eigenvaluefor the log-interactions normalization. If the dataset has anunderlying "checkerboard" structure, there is at least one pairof piecewise constant eigenvectors u and v that correspond tothe same eigenvalue. One would expect that the eigenvectors correspondingto the largest eigenvalue would provide the optimal partitionin analogy with related spectral approaches to clustering (e.g.,Shi and Malik 1997

). In principle, the classification eigenvectorsmay not belong to the largest eigenvalue, and we closely inspecta few eigenvectors that correspond to the first few largest eigenvalues.We observed that for various synthetic data with near-perfectcheckerboard-like block structure, the partitioning eigenvectorsare commonly associated with one of the largest eigenvalues, butin a few cases an eigenvector with a small eigenvalue could bethe partitioning one. (This occurs typically when the separationbetween blocks in E is smaller than the standard deviation withina block.) In order to extract partitioning information from theseeigensystems, we examine all the eigenvectors by fitting themto piecewise constant vectors. This is done by sorting the entriesof each eigenvector, testing all possible thresholds, and choosingthe eigenvector with a partition that is well approximated bya piecewise constant vector. (Selecting one threshold partitionsthe entries in the sorted eigenvector into two subsets, two thresholdsinto three subsets, and so forth.) Note that to partition theeigenvector into two, one needs to consider n $-$ 1 different thresholds;to partition it into three, it requires inspection of (n $-$ 1)(n $-$ 2)/2different thresholds, and so on. This procedure is similar toapplication of the k-means algorithm to the one-dimensional eigenvectors.(In particular, in the experiments below we performed this procedureautomatically to the six most dominant eigenvectors.) A commonpractice in spectral clustering is to perform a final clusteringstep to the data projected to a small number of eigenvectors,instead of clustering each eigenvector individually (Shi and Malik1997

). In our experiments we too perform a final clustering stepby applying both the k-means and the normalized cuts algorithmsto the data projected to the best two or threeeigenvectors.

Our clustering method provides not only a division into clusters, but also ranks the degree of membership of genes (and conditions)to the respective cluster according to the actual values in thepartitioning-sorted eigenvectors. Each partitioning-sorted eigenvectorcould be approximated by a step-like (piecewise constant) structure,but the values of the sorted eigenvector within each step aremonotonically decreasing. These values can be used to rank orrepresent gradual transitions within clusters. Such rankings mayalso be useful, for example, for revealing genes related to premalignantconditions, and for studying ranking of patients within a diseasecluster in relation toprognosis.

In addition to the uses of biclustering as a tool for data visualization and interpretation, it is natural to ask how to assessthe quality of biclusters, in terms of statistical significance,or stability. In general, this type of problem is far from settled;in fact, even in the simpler setting of ordinary clustering newefforts to address these questions regularly continue to appear.One type of approach attempts to quantify the "stability" of suspectedstructure observed in the given data. This is done by mimickingthe operation of collecting repeated independent data samplesfrom the same data-generating distribution, repeating the analysison those artificial samples, and seeing how frequently the suspectedstructure is observed in the artificial data. If the observeddata contain sufficient replication, then the bootstrap approachof Kerr and Churchill (2001)

may be applied to generate the artificialreplicated data sets. However, most experiments still lack thesort of replication required to carry this out. For such experiments,one could generate artificial data sets by adding random noise(Bittner et al. 2000

) or subsampling (Ben-Hur et al. 2002

) thegivendata.

We took an alternative approach to assess the quality of a biclustering by testing a null hypothesis of no structure in thedata matrix. We first normalized the data and used the best partitioningpair of eigenvectors (among the six leading eigenvectors) to determinean approximate 2×2 block solution. We then calculated the sumof squared errors (SSE) for the least-squares fit of these blocksto the normalized data matrix. Finally, to assess the qualityof this fit we randomly shuffled the data matrix and applied thesame process to the shuffled matrix. For example, in the breastcell oncogene data set described below, fitting the normalizeddataset to a 2×2 matrix obtained by division according to thesecond largest pair of eigenvectors of the original matrix iscompared to fitting of 10,000 shuffled matrices (after bistochastization)to their corresponding best 2×2 block approximations. The SSEfor this dataset is more than 100 standard deviations smallerthan the mean of the SSE scores obtained from the shuffled matrices,leading to a correspondingly tiny P value for the hypothesis testof randomness in the datamatrix.

Probabilistic Interpretation

In the biclustering approach, the normalization procedure, obtained by constraining the row sums to be equal to one constantand the column sums to be equal to another constant, is an integralpart of the modeling that allows us to discern bidirectional structures.This normalization can be cast in probabilistic terms by imaginingfirst choosing a random RNA transcript from all RNA in all samples(conditions), and then choosing one more RNA transcript randomlyfrom the same sample. Here, when we speak of choosing "randomly"we mean that each possible RNA is equally likely to be chosen.Having chosen these two RNAs, we take note of which sample theycome from and which genes they express. The matrix entry (R¹A)_ij may be interpreted as the conditional probabilityp_s|g(j|i) that the sample is j, giventhat the first RNA chosen was transcribed from gene i. Similarly,(C¹A^T)_jk may be interpreted as the conditional probabilitythat the gene corresponding to the first transcript is k, giventhat the sample is j. Moreover, the product of the row-normalizedmatrix and the column-normalized matrix approximates the conditionalprobability p_g^|g(i|k)of choosing a transcript from gene i, given that we also choseone from gene k. This is so because, under the assumption thatk and i are approximately conditionally independent given j, whichamounts to saying that the probability of drawing a transcriptfrom gene k, conditional on having chosen sample j, does not dependon whether or not the other RNA that we drew happened to be fromgene i, we have

p<SUB>g‖g</SUB>(k‖i) = <LIM><OP>∑</OP><LL>j</LL></LIM> p<SUB>s‖g</SUB>(j‖i)p<SUB>g‖sg</SUB>(k‖j,i) ≈ <LIM><OP>∑</OP><LL>j</LL></LIM> p<SUB>s‖g</SUB>(j‖i)p<SUB>g‖s</SUB>(k‖j)

= [(R<SUP>−1</SUP>A)(C<SUP>−1</SUP>A<SUP>T</SUP>)]<SUB>ik</SUB>.

This expression reflects the tendency of genes i and k to coexpress, averaged over the different samples. Similarly, theproduct of the column and row-normalized matrices approximatesthe conditional probability p_s|s(j|l) that reflects thesimilarity between the expression profiles of samples j and l.Note that the probabilities p_g|g(i|k)and p_s|s(j|l) define asymmetrical affinitymeasures between any pair (i,k) of genes and any pair (j,l) ofsamples, respectively. This is very different from the usual symmetricalaffinity measures, for example, correlation, used to describethe relationship between genes. However, for bistochastizaton,the matrices B^TB and BB^T represent symmetrical affinities, p_g|g(i|k)= p_g|g(k|i) and p_s|s(j|l)= p_s|s(l|j),respectively.

	RESULTS

TOP ABSTRACT INTRODUCTION METHODS RESULTS DISCUSSION REFERENCES

Overall Format of the Results

We have performed a study in which we applied the above spectral biclustering methods to five groups of cancer microarraydata sets $---$ lymphoma (microarray and Affymetrix), leukemia, breastcancer, and central nervous system embryonal tumors. As explainedabove, we utilized SVD to find pairs of piecewise constant eigenvectorsof genes and conditions, that reflect the degree to which thedata can be rearranged in a checkerboard structure. Our methodsemploy specific normalization schemes that highlight the similarityof both gene and condition eigenvectors to piecewise constantvectors, and this similarity, in turn, directly reflects the degreeof biclustering. To assess our procedure, it is useful to seehow well it compares to several benchmarks, with respect to achievingthe goal of piecewise constanteigenvectors.

Our main results are presented in Figures 3-7. These show consistently formatted graphs of the projection of each dataset ontothe best two eigenvectors. Each figure is laid out in six panels,with the first two panels associated with our biclustering methodsand the next four panels showing the benchmarks. In particular:

   Panel a   Bistochastization shows biclustering using the bistochasticnormalization.

   Panel b   Biclustering shows standard biclustering with independent rescaling of rows andcolumns.

   Panel c   SVD shows SVD applied to the raw data matrix A.

   Panel d   Binormalization shows SVD applied to a transformed matrix obtained by first rescaling its columns by their meansand then standardizing the rows of the rescaled matrix as proposedin Getz et al. (2000).

   Panel e   Normalized cuts shows a normalized cuts benchmark.Here we apply the normalized cuts algorithm using an affinity matrixobtained from a distance matrix, which, in turn, was derived bycalculating the norms of the differences between the standardizedcolumns of A as proposed in Xing and Karp (2001). (See captionof Fig. 3 for more details.) Moreover, we applied the normalizedcuts algorithm to an affinity matrix constructed from the column-rescaledrow-standardized matrix (Getz et al. 2000), as in panel (d). Wethen examined whether a partition is visible in the eigenvectorsthat correspond to the second largest eigenvalue (which in thenormalized cuts case are supposed to provide approximation ofthe optimal partition) and in the subspace spanned by two or threeeigenvectors with the best proximity to piecewise constantvectors.

   Panel f   Log-interaction shows SVD applied to a matrix where the raw expression data is substituted by the matrix K describedabove.

Overall, by comparing the six panels in each of the five different figures, we see that in the bistochastization method (panela) the distributions of the different samples have no or minimaloverlap between clusters as well as more tendency to result inmore compact clusters. The biclustering method (panel b) resultsin slightly less separable clusters, but it tends to separatethe clusters along a single eigenvector. Straight SVD of the differentraw data (panel c) underperforms in comparison to our spectralmethods, as can be seen from the intermingled distributions oftumors of different types or less distinct clusters. Performinginstead SVD on the log-interaction matrix of the raw expressiondata tends to produce results that are similar to those obtainedwith bistochastization (panel f). SVD of the column-rescaled row-standardizedmatrix (Getz et al. 2000) and the normalized cut method resultin better partitioning than SVD of the raw data (panels d ande). However, in general, our spectral methods consistently performwell.

In the following sections we discuss each of the five datasets indetail.

Lymphoma Microarray Dataset

We first applied the methods to publicly available lymphoma microarray data: chronic lymphocytic leukemia (CLL), diffuse largeB-cell lymphoma (DLCL), and follicular lymphoma (FL). The clusteringresults are shown in Figures 2 and 3. In both cases when we usedthe doubly stochastic-like matrix B or the biclustering method(C¹A^TR¹A) of the lymphoma dataset, we obtained the desired partitioningof patients in the second largest eigenvectors. The sorted eigenvectorsgive not only a partition of patients, but also an internal rankingof patients within a given disease. In addition, the outer productof the gene and tumor (sorted) eigenvectors allows us to observewhich genes induce a partition of patients and vice versa. Thiscan be seen in Figure 2. Dividing the eigenvector that correspondsto the second largest eigenvalue (in both methods) using the k-meansalgorithm (which is equivalent to fitting a piecewise constantvector to each of the eigenvectors) led to a clean partition betweenthe DLCL patients and the patients with other diseases. This ishighlighted in the header of Figure 2 and the x-axis of Figure3a,b. The published analysis did not cluster two of the DLCL casescorrectly (Alizadeh et al. 2000). Further partitioning of theCLL and the FL patients is obtained by using both the second-and third-largest eigenvectors. To divide the data we applieda recursive, two-way clustering using the normalized cuts algorithmto a two-column matrix composed of the 2nd and 3rd eigenvectorsof both matrices. (Performing a final clustering step to the dataprojected to a small number of eigenvectors is a common practicein spectral clustering.) Using the biclustering matrix with independentrow and column normalizations, the patients were correctly divided,with the exception of two of the CLL patients, who were clusteredtogether with the FL patients. The best partition was obtainedusing our doubly stochastic matrix that divided the patients perfectlyaccording to the three types of diseases.

View larger version (24K):
[in this window]
[in a new window]

Figure 2 (a) The outer product of the sorted eigenvectors u and v of the 2nd eigenvalue of the equal row- and column-sum bistochastic-like matrix B applied to a dataset with three types of lymphoma: CLL (C), FL (F), and DLCL (D). Sorting of v orders the patients according to the different diseases. (b) As in (a), the 2nd singular value contribution to the biclustering method (C¹A^T R¹A) of lymphoma CLL (C), FL (F), DLCL (D) partitioned the patients according to their disease, with one exception. We preselected all genes that had complete data along all experimental conditions (samples).

View larger version (18K):
[in this window]
[in a new window]

Figure 3 Lymphoma: Scatter plot of experimental conditions of the two best class partitioning eigenvectors v_i,v_j. The subscripts (i,j) of these eigenvectors indicate their corresponding singular values. CLL samples are denoted by red dots, DLCL by blue dots, and FL by green dots. (a) Bistochastization: the 2nd and 3rd eigenvectors of BB^T. (b) Biclustering: the 2nd and 3rd eigenvectors of R¹AC¹A^T. (c) SVD: the 2nd and 3rd eigenvectors of AA^T. (d) Normalization and SVD: the 1st and 2nd eigenvectors of ^T where is obtained by first dividing each column of A by its mean and then standardizing each row of the column normalized matrix. (e) Normalized cut algorithm: 2nd and 3rd eigenvectors of the row-stochastic matrix P. P is obtained by first creating a distance matrix S using Euclidean distance between the standardized columns of A, transforming it to an affinity matrix with zero diagonal elements and off diagonal elements defined as W_ij = exp( $-$ $alpha$ S_ij)/max(S_ij) and finally normalizing each row sum of the affinity matrix to one. (f) As in (c) but with an SVD analysis of the log interaction matrix K instead of A.

Lymphoma Affymetrix Dataset

The above lymphoma data were generated by microarray technology that provides relative measurements of expression data. Werepeated the lymphoma analysis using data from a study relatingB-CLL to memory B cells (Klein et al. 2001). These data were generatedusing Affymetrix U95A gene chips, which presumably allow measurementsproportional to absolute mRNA levels. We selected samples takenfrom CLL, FL, and DLCL patients, but in addition we also includedsamples from DLCL cell lines. As can be seen in Figure 4a,b, thebistochastization method cleanly separates the four differentsample types, and the biclustering separates these samples exceptfor one DLCL sample that slightly overlaps with the FL distribution.We note that the DLCL patient expression patterns are closer tothose of the FL patients than to the expression profiles of theDLCL cell lines (and p_g|g(DLCL|FL) >p_g|g(DLCL|DLCL-cell lines).

View larger version (17K):
[in this window]
[in a new window]

Figure 4 Scatter plots as in Fig. 3 with another lymphoma dataset generated using Affymetrix chips (Klein et al. 2001

) instead of microarrays. DLCL samples are denoted by green dots, CLL by blue dots, FL by yellow dots, and DLCL cell lines by magenta dots.

Leukemia Dataset

We applied our methods to public microarray data of acute leukemia (B- and T-cell acute lymphocytic leukemia [ALL] and acutemyelogenous leukemia [AML]). The patient distributions of thedifferent diseases of the leukemia dataset become separated inthe two-dimensional graphs generated by projecting the patientexpression profiles onto the 2nd and 3rd gene class partitionvectors of the biclustering method (Fig. 5b). The bistochasticmethod also partitions the patients well, with only one ambiguouscase that is close to the boundary between ALL and AML (Fig. 5a).Application of k-means to a matrix composed of the 2nd and 3rdbiclustering eigenvectors results in three misclassifications,which is a slight improvement over the four misclassificationsreported by Golub et al. (1999). Further partitioning of the ALLcases is obtained by applying a normalized cuts clustering methodto the biclustering eigenvectors, and produces a clear separationbetween T- and B-cell ALL. This is a slight improvement over publishedresults (two misclassifications; Golub et al. 1999; Getz et al.2000). Another advantage over their methods is that biclusteringdoes not require specification of the number of desired clustersor lengthy searches for subsets of genes.

View larger version (20K):
[in this window]
[in a new window]

Figure 5 Leukemia data presented in the same format as in Fig. 3. B-cell ALL samples are denoted by red dots, T-cell ALL by blue dots, and AML by green dots. In this analysis we preselected all genes that had positive Affymetrix average difference expression levels.

Dataset From Breast Cell Lines Transfected With the CSF1R Oncogene

In another microarray experiment study (Kluger et al. 2001), an oncogene encoding a transmembrane tyrosine kinase receptorwas mutated at two different phosphorylation sites. Benign breastcells were transfected with the wild-type oncogene, creating aphenotype that invades and metastasizes. The benign cell linewas then transfected with the two mutated oncogenes, creatingone phenotype that invades and another one that metastasizes.RNA expression levels were measured eight times for each phenotype.Transfection with a single oncogene is expected to generate similarexpression profiles, presumably because only a few genes are biologicallyinfluenced. Therefore, it was desirable to see whether profilesof the different phenotypes can bepartitioned.

Figure 8 allows us to examine the extent to which the data can be arranged in a checkerboard pattern. This is done by takingthe outer product of the cell type-sorted eigenvector that hasthe most stepwise-like structure (and is associated with the firstlargest singular value) with the corresponding gene-sorted eigenvector.Due to noise in the data and similarity between the differentsamples, common clustering techniques such as hierarchical, k-means,and medoids did not succeed in cleanly partitioning the data,but the relevant eigen-array obtained following bistochastizationor log-interaction normalization partitioned the samples perfectly.Expression levels of the four cell lines were measured in twoseparate sets of four measurements. We chose to measure the ratioof three of the cell lines: benign (a), invasive (c), and metastatic(d) with respect to the cell line that invades and metastasizes(b) in the first batch, and the corresponding ratios were similarlyderived for the second batch. In Figure 8, the ratios from thefirst and second batches are denoted by (a, c, d) and (A, C, D),respectively. As can be seen, the simultaneous normalization methodspartition the data such that all the phenotypes are separatedinto clusters $---$ that is, "a"s were clustered with "A"s in one group,"c"s with "C"s in another group, and "d"s with "D"s in yet anothergroup, as expected. Further exploration is required in order torelate those gene clusters to biological pathways that are relevantto these conditions.

View larger version (15K):
[in this window]
[in a new window]

Figure 6 Breast cell lines transfected with the CSF1R oncogene: Scatter plots as in Fig. 3 for mRNA ratios of benign breast cells and wild-type cells transfected with the CSF1R oncogene causing them to invade and metastasize (red), ratios of cells transfected with a mutated oncogene causing an invasive phenotype and cells transfected with the wild-type oncogene (blue), and ratios of cells transfected with a mutated oncogene causing a metastatic phenotype and cells transfected with the wild-type oncogene (green). In this case we preselected differentially expressed genes such that for at least one pair of samples, the genes had a twofold ratio.

View larger version (22K):
[in this window]
[in a new window]

Figure 7 Central nervous system embryonal tumor: Data generated using Affymetrix chips (Pomeroy et al. 2002

) of medulloblastoma (blue), malignant glioma (pink), normal cerebella (cyan), rhabdoid (green), and primitive neuro-ectodermal (red) tumors. Scatter plots of experimental conditions projected onto the three best class partitioning eigenvectors using the same format as in Fig. 3.

View larger version (49K):
[in this window]
[in a new window]

Figure 8 Optimal array partitioning obtained by the 1st singular vectors of the log-interaction matrix. The data consist of eight measurements of mRNA ratios for three pairs of cell types: (A,a) benign breast cells and the wild-type cells transfected with the CSF1R oncogene causing them to invade and metastasize; (C,c) cells transfected with a mutated oncogene causing an invasive phenotype and cells transfected with the wild-type oncogene; and (D,d) cells transfected with a mutated oncogene causing a metastatic phenotype and cells transfected with the wild-type oncogene. In this case we preselected differentially expressed genes such that for at least one pair of samples, the genes had a threefold ratio. The sorted eigen-gene v₁ and eigen-array u₁ have gaps indicating partitioning of patients and genes, respectively. As a result, the outer product matrix sort(u₁) sort(v₁)^T has a "soft" block structure. The block structure is hardly seen when the raw data are sorted but not normalized. However, it is more noticeable when the data are both sorted and normalized. Also shown are the conditions projected onto the first two partitioning eigenvectors u₁ and u₂. Obviously, using the extra dimension gives a clearer separation.

Central Nervous System Embryonal Tumor Dataset

Finally, we analyzed the recently published CNS embryonal tumor dataset (Pomeroy et al. 2002): Pomeroy et al. partitionedthese five tumor types using standard principal component analysis,but did so after employing a preselection of genes exhibitingvariation across the data set (see Fig. 1b in Pomeroy et al. 2002).Using all genes, we find that the bistochastization method, andto a lesser degree the biclustering method, partitioned the medulloblastoma,malignant glioma, and normal cerebella tumors. As can be seenin Figure 7, the remaining rhabdoid tumors are more widely scatteredin the subspace obtained by projecting the tumors onto the 2nd-4thgene partitioning eigenvectors of the biclustering and bistochastizationmethods. Nonetheless, the rhabdoid tumor distribution does notoverlap with the other tumor distributions if we use the bistochastizationmethod. The primitive neuro-ectodermal tumors (PNETs) did notcluster and were difficult to classify even using supervisedmethods.

	DISCUSSION

TOP ABSTRACT INTRODUCTION METHODS RESULTS DISCUSSION REFERENCES

Unsupervised clustering of genes and experimental conditions in microarray data can potentially reveal genes that participatein cellular mechanisms that are involved in various diseases.In this paper we present a spectral biclustering method that utilizesthe information gained by clustering the conditions to facilitatethe clustering of genes, and vice versa. The method incorporatesa closely integrated normalization. It also naturally discardsthe irrelevant constant background, such that no additional argumentsare needed to ignore the contribution associated with the largesteigenvalue, as advocated in Alter et al. (2000). In particular,our method is designed to cluster populations of different tumorsassuming that each tumor type has a subset of marker genes thatexhibit overexpression and that typically are not overexpressedin other tumors. The main underlying assumption is that we cansimultaneously obtain better tumor clusters and gene clustersby correlating genes averaged over different samples of the sametumors. Likewise, the correlation of two tumors is more apparentwhen averaged over sets of genes of similar expression profiles.In situations where the number of tumor types (the number of clustersof experimental conditions) happens to be equal to the numberof typical gene profiles (the number of gene clusters), the biclusteringalgorithm is related to the modified normalized cuts objectivefunction introduced by Dhillon (2001). In addition, in a situationwhere the data have approximately a checkerboard structure withmore than two clusters on each side, there may be several eigenvectorsindicating a partitioning. In this case we may be able to determinethe number of clusters by identifying all of these eigenvectors,for example, using a pairwise measure such as mutual entropy betweenall pairs ofeigenvectors.

The methods presented in this paper, particularly those incorporating simultaneous normalization of rows and columns, showconsistent advantage over SVD spectral analysis of the raw data,the logarithm of the raw data, other forms of rescaling transformationsof the raw data, and the normalized cuts partitioning of the rawor rescaled data. Nevertheless, our partitioning results are notperfect. Better results may be obtained by employing a generativemodel that better suits the data. It has been shown that removalof irrelevant genes that introduce noise can further improve clustering(as in Xing and Karp 2001). Furthermore, if partitioning in thegene dimension is sharper than partitioning in the condition dimensionor vice versa, we can organize the conditions or genes of theblurrier dimension contiguously. Such arrangements perhaps giveone a sense of the progression of disease states or relevanceof a gene to a particulardisease.

	ACKNOWLEDGMENTS

Y.K. is supported by the Cancer Bioinformatics Fellowship from the Anna Fuller Fund, and M.G. acknowledges support from HumanGenome array: Technology for Functional Analysis (an NIH grantnumber P50 HG02357-01).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734solely to indicate thisfact.

	FOOTNOTES

⁶ Correspondingauthor.

E-MAIL genomeresearch@bioinfo.mbb.yale.edu; FAX (360) 838-7861.

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.648603.

REFERENCES

TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

	REFERENCES

Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503-511[CrossRef][Medline].
Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97: 10101-10106[Abstract/Free Full Text].
Bapat, R.B. and Raghavan, T.E.S. 1997. Non-negative matrices and applications. Chapter 6. Cambridge University Press, Cambridge, UK.
Ben-Hur, A., Elisseeff, A., and Guyon, I. 2002. A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput.: 6-17.
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540[CrossRef][Medline].
Brown, P.O. and Botstein, D. 1999. Exploring the new world of the genome with DNA microarrays. Nat. Genet. 21: 33-37[CrossRef][Medline].
Brown, M.P.S., Grundy, W.N., Lin, D., Sugnet, C., Ares, J.M., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97: 262-267[Abstract/Free Full Text].
Cheng, Y. and Church, G.M. 2000. Biclustering of expression data. In 8^th International Conference on Intelligent Systems for Molecular Biology, August 2000. UC San Diego, La Jolla, CA.
Dhillon, I.S. 2001. Coclustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh Association for Computing Machinery, Special Interest Group on Knowledge Discovery in Data and Data Mining Conference, San Francisco, CA.
Eisen, M., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95: 14863-14868[Abstract/Free Full Text].
Friedman, N., Linial, M., Nachman, I., and Pe'er, D. 2000. Using Bayesian networks to analyze expression data. J. Comp. Biol. 7: 601-620[CrossRef].
Getz, G., Levine, E., and Domany, E. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. 97: 12079-12084[Abstract/Free Full Text].
Golub, G.H. and Van Loan, C.F. 1983. Matrix computations. Johns Hopkins University Press, Baltimore, MD.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531-537[Abstract/Free Full Text].
Hartigan, J.A. 1972. Direct clustering of a data matrix. J. Am. Stat. Assoc. 67: 123-129.
Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., and Brown, P.O. 2000. "Gene shaving" as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1: research0003.0001-0003.0021.
Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P.O., and Botstein, D. 1999. Imputing missing data for gene expression arrays. Stanford Statistics Department, Stanford, CA. http://www-stat.stanford.edu/~hastie/papers/missing.pdf
Hofmann, T. and Puzicha, J. 1999. Latent class models for collaborative filtering. In Proceedings of the International Joint Conference in Artificial Intelligence. IJCAI 1999:, pp. 688-693. IJCAI Inc., Somerset, NJ.
Holter, N.S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J.R., and Fedoroff, N.V. 2000. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. 97: 8409-8414[Abstract/Free Full Text].
Kerr, M.K. and Churchill, G.A. 2001. Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. 98: 8961-8965[Abstract/Free Full Text].
Klein, U., Tu, Y., Stolovitzky, G.A., Mattioli, M., Cattoretti, G., Husson, H., Freedman, A., Inghirami, G., Cro, L., Baldini, L. 2001. Gene expression profiling of B cell chronic lymphocytic leukemia reveals a homogeneous phenotype related to memory B cells. J. Exp. Med. 194: 1625-1638[Abstract/Free Full Text].
Kluger, H., Kacinski, B., Kluger, Y., Mironenko, O., Gilmore-Hebert, M., Chang, J., Perkins, A.S., and Sapi, E. 2001. Microarry analysis of invasive and metastatic phenotypes in a breast cancer model. In Poster presented at the Gordon Conference on Cancer, Newport, RI.
Lazzeroni, L. and Owen, A. 2002. Plaid models for gene expression data. Statistica Sinica 12: 61-86.
Lian, Z., Wang, L., Yamaga, S., Bonds, W., Beazer-Barclay, Y., Kluger, Y., Gerstein, M., Newburger, P.E., Berliner, N., and Weissman, S.M. 2001. Genomic and proteomic analysis of the myeloid differentiation program. Blood 98: 513-524[Abstract/Free Full Text].
Lockhart, D.J. and Winzeler, E.A. 2000. Genomics, gene expression and DNA arrays. Nature 405: 827-836[CrossRef][Medline].
Mateos, A., Dopazo, J., Jansen, R., Tu, Y., Gerstein, M., and Stolovitzky, G. 2002. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res. 12: 1703-1715[Abstract/Free Full Text].
Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series 2: 559-572.
Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436-442[CrossRef][Medline].
Raychaudhuri, S., Stuart, J.M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: Application to sporulation time series. In 2000 Pacific Symposium on Biocomputing, pp. 452-463..
Shi, J. and Malik, J. 1997. Normalized cuts and image segmentation. In IEEE Conf. Computer Vision and Pattern Recognition, pp. 731-737..
Stolovitzky, G., Califano, A., and Tu, Y. 2000. Analysis of gene expression microarrays for phenotype classification. In 8^th International Conference on Intelligent Systems for Molecular Biology, August 2000. UC San Diego, La Jolla, CA.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96: 2907-2912[Abstract/Free Full Text].
Ungar, L. and Foster, A. 1998. A formal statistical approach to collaborative filtering. In Conference on Automated Learning and Discovery CONALD '98, CMU.
van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530-536[CrossRef][Medline].
Weiss, Y. 1999. Segmentation using eigenvectors: A unifying view. In Proceedings IEEE International Conference on Computer Vision, pp. 975-982..
Xing, E.P. and Karp, R.M. 2001. CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. In 9^th International Conference on Intelligent Systems for Molecular Biology, July 2001. Copenhagen, Denmark.

Received July 22, 2002; accepted in revised form January 28, 2003.

Abstract of this Article (

)

Reprint (PDF) Version of this Article

Email this article to a friend

	Panel a Bistochastization shows biclustering using the bistochasticnormalization.
	Panel b Biclustering shows standard biclustering with independent rescaling of rows andcolumns.
	Panel c SVD shows SVD applied to the raw data matrix A.
	Panel d Binormalization shows SVD applied to a transformed matrix obtained by first rescaling its columns by their meansand then standardizing the rows of the rescaled matrix as proposedin Getz et al. (2000).
	Panel e Normalized cuts shows a normalized cuts benchmark.Here we apply the normalized cuts algorithm using an affinity matrixobtained from a distance matrix, which, in turn, was derived bycalculating the norms of the differences between the standardizedcolumns of A as proposed in Xing and Karp (2001). (See captionof Fig. 3 for more details.) Moreover, we applied the normalizedcuts algorithm to an affinity matrix constructed from the column-rescaledrow-standardized matrix (Getz et al. 2000), as in panel (d). Wethen examined whether a partition is visible in the eigenvectorsthat correspond to the second largest eigenvalue (which in thenormalized cuts case are supposed to provide approximation ofthe optimal partition) and in the subspace spanned by two or threeeigenvectors with the best proximity to piecewise constantvectors.
	Panel f Log-interaction shows SVD applied to a matrix where the raw expression data is substituted by the matrix K describedabove.

METHODS Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions

METHODS
Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions