Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error with multiscale hierarchical clustering in R

I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.

Consider the following data set with 3 dimensions and 10 observations:

mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
                        "D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
                        "G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
                        "J"=c(984,98,49)))

When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:

# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")

# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")

However, when using the each set up with pvclust, as follows:

library(pvclust)

# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)

# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)

... I get the following errors:

  • Euclidean: Error in hclust(distance, method = method.hclust) : must have n >= 2 objects to cluster
  • Correlation: Error in cor(x, method = "pearson", use = use.cor) : supply both 'x' and 'y' or a matrix-like 'x'.

Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.

When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?

like image 516
oisyutat Avatar asked Nov 04 '22 14:11

oisyutat


1 Answers

At the end of function pvclust we see a line

mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust, 
    nboot = nboot, method.dist = method.dist, use.cor = use.cor, 
    method.hclust = method.hclust, store = store, weight = weight)

then digging deeper we find

getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust, 
    nboot, store, weight = F) 
{
    n <- nrow(data)
    size <- round(n * r, digits = 0)
    ....
            smpl <- sample(1:n, size, replace = TRUE)
            suppressWarnings(distance <- dist.pvclust(data[smpl, 
                ], method = method.dist, use.cor = use.cor))
    ....
}

also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:

Bootstrap (r = 0.33)... 

so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:

mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", 
                    nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
  number of items to replace is not a multiple of replacement length

Let me know if the results is satisfactory.

like image 200
Julius Vainora Avatar answered Nov 15 '22 09:11

Julius Vainora