I'm trying to use hirearchical clustering (specifically hclust
) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut()
and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming.
I've experimented with the dynamicTreeCut
package, but can't figure out how to enter these (relatively simple) limitations. I'm using deepSplit
as the way to designate the number of groupings, but following the documentation, this limits the maximum number to 4. For the exercise below, all I'm looking to do is to get the clusters into 5 groups of 3 or more individuals (I can deal with the maximum size limitation on my own, but if you want to try to tackle this too, it would be helpful!).
Here's my example, using the Orange
dataset.
library(dynamicTreeCut)
library(reshape2)
##creating 14 individuals from Orange's original 5
Orange1<-Orange
Orange1$Tree<-as.numeric(as.character(Orange1$Tree))
Orange2<-Orange1
Orange3<-Orange1
Orange2$Tree=Orange2$Tree+6
Orange3$Tree=Orange3$Tree+11
combOr<-rbind(Orange1, Orange2[1:28,], Orange3)
####casting the data to make a correlation matrix, and then running
#### a hierarchical cluster
castOrange<-dcast(combOr, age~Tree, mean, fill=0)
castOrange[,16]<-c(1,34,5,35,34,35,21)
castOrange[,17]<-c(1,34,5,35,34,35,21)
orangeCorr<-cor(castOrange[, -1])
orangeClust<-hclust(dist(orangeCorr))
###running the dynamic tree cut
dynamicCut<-cutreeDynamic(orangeClust, minClusterSize=3, method="tree", deepSplit=4)
dynamicCut
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As you can see, it only designates two clusters. For my exercise, I want to shy away from using an explicit height term to cut the trees, as I want a k
number of trees instead.
1- Figure out the most appropriate dissimilarity measure (e.g., "euclidean"
, "maximum"
, "manhattan"
, "canberra"
, "binary"
, or "minkowski"
) and linkage method (e.g., "ward"
, "single"
, "complete"
, "average"
, "mcquitty"
, "median"
, or "centroid"
) based on the nature of your data and the objective(s) of clustering. See ?dist
and ?hclust
for more details.
2- Plot the dendogram tree before starting the cutting step. See ?hclust
for more details.
3- Use the hybrid adaptive tree cut method in dynamicTreeCut
package, and tune the shape parameters (maxCoreScatter
and minGap
/ maxAbsCoreScatter
and minAbsGap
). See Langfelder et al. 2009 (http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/BranchCutting/Supplement.pdf).
For your example,
1- Change "euclidean"
and/or "complete"
methods as appropriate,
orangeClust <- hclust(dist(orangeCorr, method="euclidean"), method="complete")
2- Plot dendogram,
plot(orangeClust)
3- Use the hybrid tree cut method and tune shape parameters,
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=NULL, minGap=NULL, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
As a guide for tuning the shape parameters, the default values are
deepSplit=0: maxCoreScatter = 0.64 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=1: maxCoreScatter = 0.73 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=2: maxCoreScatter = 0.82 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=3: maxCoreScatter = 0.91 & minGap = (1 - maxCoreScatter) * 3/4
deepSplit=4: maxCoreScatter = 0.95 & minGap = (1 - maxCoreScatter) * 3/4
As you can see, both maxCoreScatter
and minGap
should be between 0
and 1
, and increasing maxCoreScatter
(decreasing minGap
) increases the number of clusters (with smaller sizes). The meaning of these parameters is described in Langfelder et al. 2009.
For example, to get more smaller clusters
maxCoreScatter <- 0.99
minGap <- (1 - maxCoreScatter) * 3/4
dynamicCut <- cutreeDynamic(orangeClust, minClusterSize=3, method="hybrid", distM=as.matrix(dist(orangeCorr, method="euclidean")), deepSplit=4, maxCoreScatter=maxCoreScatter, minGap=minGap, maxAbsCoreScatter=NULL, minAbsGap=NULL)
dynamicCut
..cutHeight not given, setting it to 1.8 ===> 99% of the (truncated) height range in dendro.
..done.
2 3 2 2 2 3 3 2 2 3 3 2 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 0 0
Finally, your clustering constraints (size, height, number, ... etc) should be reasonable and interpretable, and the generated clusters should agree with the data. This guides you to the important step of clustering validation and interpretation.
Good Luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With