Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy. 
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c 2 d,e,f,g Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
The hclust function in R uses the complete linkage method for hierarchical clustering by default. This particular clustering method defines the cluster distance between two clusters to be the maximum distance between their individual components.
This can be done with the R function cutree. It cuts a tree (or dendogram), as resulting from hclust (or diana/agnes), into several groups either by specifying the desired number of groups (k) or the cut height (h).
There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).
We consider cost functions for cluster trees that capture the quality of the hierarchical clustering produced by $T$. The Axiom.
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave") #plot(hc) You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)        Alabama         Alaska        Arizona       Arkansas     California               1              1              1              2              1        Colorado    Connecticut       Delaware        Florida        Georgia               2              2              1              1              2          Hawaii          Idaho       Illinois        Indiana           Iowa               2              2              1              2              2          Kansas       Kentucky      Louisiana          Maine       Maryland               2              2              1              2              1   Massachusetts       Michigan      Minnesota    Mississippi       Missouri               2              1              2              1              2         Montana       Nebraska         Nevada  New Hampshire     New Jersey               2              2              1              2              2      New Mexico       New York North Carolina   North Dakota           Ohio               1              1              1              2              2        Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina               2              2              2              2              1    South Dakota      Tennessee          Texas           Utah        Vermont               2              2              2              2              2        Virginia     Washington  West Virginia      Wisconsin        Wyoming               2              2              2              2              2 lets say,
y<-dist(x) clust<-hclust(y) groups<-cutree(clust, k=3) x<-cbind(x,groups) now you will get for each record, the cluster group. You can subset the dataset as well:
x1<- subset(x, groups==1) x2<- subset(x, groups==2) x3<- subset(x, groups==3) If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With