Using plot(hclust(dist(x)))
method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c
is a cluster and if d e f g
is a cluster then I would like to get something like this:
1 a,b,c 2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
The hclust function in R uses the complete linkage method for hierarchical clustering by default. This particular clustering method defines the cluster distance between two clusters to be the maximum distance between their individual components.
This can be done with the R function cutree. It cuts a tree (or dendogram), as resulting from hclust (or diana/agnes), into several groups either by specifying the desired number of groups (k) or the cut height (h).
There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).
We consider cost functions for cluster trees that capture the quality of the hierarchical clustering produced by $T$. The Axiom.
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave") #plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k
parameter. See ?cutree
and the use of paramter h
which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)
).
cutree(hc, k = 2) Alabama Alaska Arizona Arkansas California 1 1 1 2 1 Colorado Connecticut Delaware Florida Georgia 2 2 1 1 2 Hawaii Idaho Illinois Indiana Iowa 2 2 1 2 2 Kansas Kentucky Louisiana Maine Maryland 2 2 1 2 1 Massachusetts Michigan Minnesota Mississippi Missouri 2 1 2 1 2 Montana Nebraska Nevada New Hampshire New Jersey 2 2 1 2 2 New Mexico New York North Carolina North Dakota Ohio 1 1 1 2 2 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 2 2 2 2 1 South Dakota Tennessee Texas Utah Vermont 2 2 2 2 2 Virginia Washington West Virginia Wisconsin Wyoming 2 2 2 2 2
lets say,
y<-dist(x) clust<-hclust(y) groups<-cutree(clust, k=3) x<-cbind(x,groups)
now you will get for each record, the cluster group. You can subset the dataset as well:
x1<- subset(x, groups==1) x2<- subset(x, groups==2) x3<- subset(x, groups==3)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With