I'm running some cluster analysis and I'm trying to figure out two main things:
1) How to best interpret the results of the p-values in pvclust (what is the null that they are establishing?)
2) How to translate these results to hclust
I'll use the mtcars
data set as an example...
In using the pvclust
function first (using euclidean distance and complete linkage):
d.pv <- pvclust(t(mtcars), method = "euclidean",
method.hclust = "complete", nboot = 10)
I then produce my dendrogram and as for red boxes to be placed around significant clusters (alpha = 0.95)
plot(d.pv)
pvrect(d.pv, alpha = 0.95)
I get this figure:
I can also call up the observations that were significant:
pvpick(team.clus.pv, alpha = 0.95)
But what is the significance of these findings? We see two clusters in the dendrogram, so is this significant finding (p = 0.02) that these two clusters differ? Is it that simple?
Since cluster analysis is a descriptive/exploratory technique, what if I built this using the hclust
package and specified that I felt there were 3 clusters of interest?
d <- dist(mtcars, method = "euclidean")
hc <- hclust(d, method = "complete")
plot(hc)
rect.hclust(hc, k = 3, border = "red")
Now my dendrogram looks like this:
Because I am interested in 3 clusters, is there a way to get pvclust
to make the comparison between these 3 clusters? Or maybe that isn't what pvclust
is comparing? What is the null hypothesis here?
If I look at the dendrogram produced from pvclust
, it seems that within the red box on the right the two other clusters I am interested in specifying (with hclust
) also have significant p-values. How would I report this or explain it?
Maybe this is more a question for Cross Validated than for Stack Overflow ?
I'm not certain that the way you use pvclust
here - on the transposed dataset - is adapted.
This is more clear when you understand how pvclust works and how to interpret the output.
Please take this reply with a grain of salt as I'm not a specialist of this kind of methods.
NB : The mtcars
dataset has different car models on the rows and different cars characteristics on the columns. You perform a clustering of the rows (cars models) based on their characteristics.
pvclust
outputpvclust
provides 2 types of results displayed at each node of the dendrogram as red and green numbers (the gray numbers are just the node number) or via the print
method :
Bootstrap Probability * 100
(BP) and has the most straightforward interpretationApproximately Unbiased p-value * 100
(AU) and is a corrected version of BP to limit biasThe interpretation of BP is quite straightforward. In your example pvclust
will bootstrap the columns of the dataset - here the characteristics of the cars - (ie some columns will be dropped and others will appear more than once) and then recompute the Euclidean distance between the rows and compute the dendrogram. It will repeat this nboot * length(r)
times (length(r) = 10
by default) and for each run it will check for each cluster if the same set of cars are present together in a cluster from the bootstraped dataset.
In your example the first cluster on the left has a BP = 0.31 meaning that these 9 cars models (Maserati Bora, Chrysler Imperial,..., Pontia Firebird) ended to be in a same cluster in 31 runs out of the 100 bootstraps.
This means that if you would use another random sample of cars characteristics (but does this makes sense ??) this cluster will appear only in 31% of the cases. NB : the topology inside the cluster might differ between the bootstraps. The only criterion checked is wether these cars are in the same cluster or not whatever their relationships inside the cluster.
As for your second question, you have this BP number for each node so you can interpret any of the clusters in the same way. For example the middel cluster you show with hclust
(with 7 cars : Homet 4 drive,..., AMC Javelin) has a BP = 0.39 and this can be interpreted as explained above.
pvclust
to test clustering on the observations (rows) ?The bootstrap is typically used to estimate the variability of a statistic if you could resample new datasets.
In your example if you could resample a new dataset, you will probably have the same car characteristics
but a different set of car models. The car models are the samples/observations here and it makes more sense to resample the observations than the features/variables. pvclust
is intended to be used to test clustering of
the variables (columns) not the observations (rows). This probably why it performs the clustering on the columns
while hclust
performs the clustering on the rows.
The methods used by pvclust
take their origin in phyllogenetic analysis where you try to cluster individuals from different species based on their DNA or other characteristics.
The phyllogenetic or genomic case is however a little bit peculiar. In this situation, the observations/samples
are the individuals (source of the DNA) and the variables/features/descriptors are for example the genes or their expression level. In this case it makes sense to cluster the individuals based on their genes and to test the stability of the clusters by bootstraping the genes because the genes are manifold and it is reasonable to resample at random within the genes pool.
In ecology you work typically on sites (=observations) x species (= variables) matrices. And it is common to perform clustering analysis on the sites based on their specific composition. In this case it makes no sense to bootstrap the species to test the stability of the sites clusters because the whole community should be used to characterize the sites and you cannot just drop some species at random... But it makes sense to perform clustering analysis of the species and to test the stability of the species associations if we could resample a new set of sites.
The AU should be interpreted in a similar way as BP but is considered as an unbiased version. However the theory behind it is much more technical and outside of my reach... So, a priori you should use this value rather than BP for your interpretation of the clusters but I suppose that you should be even more cautious when the resampling is done on the variables as in your example because the bias correction has probably not been developed in that context (this is just a guess, I don't know exactly how risky this is).
BP is often presented as a downward biased statistic. For example Paradis (2011) explains that if BP is high then the cluster is supported by the data. But if it is low, it could be because the cluster is not suported by the data or because of bias. But Efron et al (1996) argue that this is a misinterpretation of what the bootstrap method is actually testing and that BP not systematically downward biased. They propose an alternative way to compute this p-value "to better agree with standard ideas of confidence levels and hypothesis testing" with a "two level bootstrap algorithm". Suzuki & Shimodaira have extended this approach with a "multistep-multiscale bootstrap algorithm" that is used in pvclust
to compute AU.
In first step of this algorithm you redo the bootstrapping several times but with different sample sizes. In your example, there are 11 columns, so a classical bootstrap will resample with replacement these columns 11 times and repeat this nboot times (default = 1000, you chose nboot = 10). With "multistep-multiscale bootstrap algorithm" you repeat this process 1000 * 10
times (by default) but with a sample size of 5, 6, 7, 8, 9, 11, 12, 13, 14 and 15. This is controlled by the r
argument of the function that provides the ratio of the original sample size (here = 11) and the bootstrap sample size (5 to 15 by default here). This is displayed on the console when you run pvclust : Bootstrap (r = 0.45)... Done.
(ie first set of bootstrap with sample size = 0.45*11 = 5
).
Efron B. et al. (1996) Bootstrap confidence levels for phylogenetic trees, Proc. Natl Acad. Sci. 93 : 13429-13434
Paradis, E. (2011) Analysis of Phylogenetics and Evolution with R, 1st ed. Springer.
Shimodaira H. (2004) Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling, Ann. Stat. 32 : 2616-2641
Suzuki, R., Shimodaira, H., (2006) Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22 : 1540–1542
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With