Using <code>plot(hclust(dist(x)))</code> method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy. In other words, lets say if <code>a b c</code> is a cluster and if <code>d e f g</code> is a cluster then I would like to get something like this: <pre class="prettyprint"><code>1 a,b,c 2 d,e,f,g </code></pre> Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to. How is this possible?

I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table. Construct a hclust object. <pre class="prettyprint"><code>hc <- hclust(dist(USArrests), "ave") #plot(hc) </code></pre> You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the <code>k</code> parameter. See <code>?cutree</code> and the use of paramter <code>h</code> which may be more useful to you (see <code>cutree(hc, k = 2) == cutree(hc, h = 110)</code>). <pre class="prettyprint"><code>cutree(hc, k = 2) Alabama Alaska Arizona Arkansas California 1 1 1 2 1 Colorado Connecticut Delaware Florida Georgia 2 2 1 1 2 Hawaii Idaho Illinois Indiana Iowa 2 2 1 2 2 Kansas Kentucky Louisiana Maine Maryland 2 2 1 2 1 Massachusetts Michigan Minnesota Mississippi Missouri 2 1 2 1 2 Montana Nebraska Nevada New Hampshire New Jersey 2 2 1 2 2 New Mexico New York North Carolina North Dakota Ohio 1 1 1 2 2 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 2 2 2 2 1 South Dakota Tennessee Texas Utah Vermont 2 2 2 2 2 Virginia Washington West Virginia Wisconsin Wyoming 2 2 2 2 2 </code></pre>

Clustering list for hclust function

Tags:

r

hclust

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.

In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:

1 a,b,c 2 d,e,f,g

Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.

How is this possible?

265

asked Jun 29 '11 09:06

dave

2 Answers

I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.

Construct a hclust object.

hc <- hclust(dist(USArrests), "ave") #plot(hc)

You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).

cutree(hc, k = 2)        Alabama         Alaska        Arizona       Arkansas     California               1              1              1              2              1        Colorado    Connecticut       Delaware        Florida        Georgia               2              2              1              1              2          Hawaii          Idaho       Illinois        Indiana           Iowa               2              2              1              2              2          Kansas       Kentucky      Louisiana          Maine       Maryland               2              2              1              2              1   Massachusetts       Michigan      Minnesota    Mississippi       Missouri               2              1              2              1              2         Montana       Nebraska         Nevada  New Hampshire     New Jersey               2              2              1              2              2      New Mexico       New York North Carolina   North Dakota           Ohio               1              1              1              2              2        Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina               2              2              2              2              1    South Dakota      Tennessee          Texas           Utah        Vermont               2              2              2              2              2        Virginia     Washington  West Virginia      Wisconsin        Wyoming               2              2              2              2              2

147

answered Oct 02 '22 18:10

Roman Luštrik

lets say,

y<-dist(x) clust<-hclust(y) groups<-cutree(clust, k=3) x<-cbind(x,groups)

now you will get for each record, the cluster group. You can subset the dataset as well:

x1<- subset(x, groups==1) x2<- subset(x, groups==2) x3<- subset(x, groups==3)

answered Oct 02 '22 20:10

user2783711

Related questions
                            
                                How do I access the data frame that has been passed to ggplot()?
                            
                                R regular expressions: unexpected behavior of "[:digit:]"
                            
                                Split/subset a data frame by factors in one column [duplicate]
                            
                                Modify package function
                            
                                R markdown: can I insert a pdf to the r markdown file as an image?
                            
                                How to remove outliers in boxplot in R? [duplicate]
                            
                                Saving multiple outputs of foreach dopar loop
                            
                                detect non ascii characters in a string
                            
                                Python pandas equivalent to R groupby mutate
                            
                                Fonts in R plots
                            
                                Apply function to every value in an R dataframe
                            
                                Control number of decimal places on xtable output in R
                            
                                Error in terms.formula(formula) : '.' in formula and no 'data' argument
                            
                                Reliable way to detect if a column in a data.frame is.POSIXct
                            
                                How to sort files list by date?
                            
                                Is there a faster lm function
                            
                                dplyr: inner_join with a partial string match
                            
                                Skip specific rows using read.csv in R [duplicate]
                            
                                Dividing columns by colSums in R
                            
                                Is set.seed consistent over different versions of R (and Ubuntu)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With