Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kmeans inter and intra cluster ordering

Tags:

r

I am wondering what other people are doing with K-means cluster ordering. I am making heatmaps (mainly of ChIP-Seq data) and getting nice looking figures with a custom heatmap function (based off of R's built in heatmap function). However, I'd like two improvements. The first is to order my clusters based on decreasing average value. For instance, the following code:

fit = kmeans(data, 8, iter.max=50, nstart=10)
d = data.frame(data, symbol)
d = data.frame(d, fit$cluster)
d = d[order(d$fit.cluster),]

gives me a data.frame ordered on a clusters column. What is the best way to order the rows such that the 8 clusters are in order of their respective means?

Second, do you recommend sorting the rows WITHIN each cluster from highest mean value to lowest? This will impose a more organized look onto the data, but may fool a non-cautious observer into inferring something that he perhaps should not. If you do recommend this, how would you do it most efficiently?

like image 385
Ron Gejman Avatar asked Oct 14 '22 17:10

Ron Gejman


1 Answers

Not an exact answer to what you ask, but perhaps you might consider seriation instead of k-means clustering. It is a bit like ordination rather than clustering, but one end result is a heatmap of the seriated data which sounds similar to what you seem to be doing with k-means followed by a specifically ordered heatmap.

There is an R package for seriation, called seriation and it has a vignette, which you can get directly from CRAN

I'll answer the specifics of the Q once I've cooked up an example to try.

Ok - proper answer following on from your comment above. First some dummy data - 3 clusters of 10 samples each, on each of 3 variables.

set.seed(1)
dat <- data.frame(A = c(rnorm(10, 2), rnorm(10, -2), rnorm(10, -2)),
                  B = c(rnorm(10, 0), rnorm(10, 5), rnorm(10, -2)),
                  C = c(rnorm(10, 0), rnorm(10, 0), rnorm(10, -10)))

## randomise the rows
dat <- dat[sample(nrow(dat)),]
clus <- kmeans(scale(dat, scale = FALSE), centers = 3, iter.max = 50,
               nstart = 10)

## means of n points in each cluster
mns <- sapply(split(dat, clus$cluster), function(x) mean(unlist(x)))

## order the data by cluster with clusters ordered by `mns`, low to high
dat2 <- do.call("rbind", split(dat, clus$cluster)[order(mns)])

## heatmaps
## original first, then reordered:
layout(matrix(1:2, ncol = 2))
image(1:3, 1:30, t(data.matrix(dat)), ylab = "Observations", 
      xlab = "Variables", xaxt = "n", main = "Original")
axis(1, at = 1:3)
image(1:3, 1:30, t(data.matrix(dat2)), ylab = "Observations", 
      xlab = "Variables", xaxt = "n", main = "Reordered")
axis(1, at = 1:3)
layout(1)

Yielding:

Original and reordered heatmaps

like image 181
Gavin Simpson Avatar answered Oct 18 '22 03:10

Gavin Simpson