I have the following dataset: <pre class="prettyprint"><code>data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) for(i in 1:nrow(data)){ data[i,i]<-NA} colnames(data) <- c("A","B","C","D") rownames(data) <- c("A","B","C","D") plot(hclust(dist(data))) </code></pre> and then the result is the below image: <img src="https://i.stack.imgur.com/ydsK1.png" alt="enter image description here"> But, I am wondering how this plot is drawn. Here, I am trying to obtain the dendrogram step by step. We know that the distance matrix at the begining is as follow: <img src="https://i.stack.imgur.com/qs4da.png" alt="enter image description here"> Every time we find the two points with minimum distance, and then merge them as a single cluster <img src="https://i.stack.imgur.com/sVgTF.png" alt="enter image description here"> So, the first merge are B, and C.And we update the distance matrix <img src="https://i.stack.imgur.com/TAiNi.png" alt="enter image description here"> Again we find the 2 points with minimum distance, which is <code>D</code> with cluster of <code>B,C</code> <img src="https://i.stack.imgur.com/N3yhd.png" alt="enter image description here"> Again we update the distance matrix <img src="https://i.stack.imgur.com/3a4j3.png" alt="enter image description here"> As a result I should have the following merges <ol> <li>B, and C</li> <li>B,C, and D</li> <li>B,C,D, and A</li> </ol> But here the is a paradox with what <code>R</code> plot produced. So, how do you justify it?

<h3>Updated Response - Using <code>single</code> linkage rather than the default <code>complete</code> linkage.</h3> I'll do my best to explain how I see this working. I believe this is as simple as the <code>method</code> argument used in <code>hclust.</code> The default method for <code>hclust</code> does not follow the algorithm that you laid out but we can adjust the <code>method</code> so it does. But first, I am getting an error on the plot you are trying to make: <pre class="prettyprint"><code>> data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) > for(i in 1:nrow(data)){ data[i,i]<-NA} > colnames(data) <- c("A","B","C","D") > rownames(data) <- c("A","B","C","D") > plot(hclust(dist(data))) Error in hclust(dist(data)) : NA/NaN/Inf in foreign function call (arg 11) </code></pre> What is your intention with the <code>for(i in 1:nrow(data)){ data[i,i]<-NA}</code> line? After that line, your <code>data</code> object looks like this: <pre class="prettyprint"><code> X Y V3 V4 1 NA 1 NA NA 2 2 NA NA NA 3 3 2 NA NA 4 4 1 NA NA </code></pre> However, if we can just start with the following code, we can generate the desired tree as follows: <pre class="prettyprint"><code>dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1)) rownames(dt) <- c("A", "B", "C", "D") dt<-dist(dt) plot(hclust(dt, method = "single")) </code></pre> <img src="https://i.stack.imgur.com/vdLNO.png" alt="enter image description here"> NOTE the change in <code>method</code> on the <code>hclust</code> call to <code>method = single</code>. The default <code>method</code> is <code>method = complete</code>. The <code>complete</code> linkage method does not join clusters to nodes based on the shortest distance but on the longest intercluster distance. Extracting some material from the fantastic Introduction to Statistical Learning with Applications in R which describes the various linkage methods available: <img src="https://i.stack.imgur.com/npJob.png" alt="enter image description here"> This text, by James, Witten, Hastie, and Tibshirani, is available as a free download at the link above. The section on hierarchical clustering starts on page 390. Please let me know if this helps clear things up. <h3>Original Response</h3> I think you are calling the <code>dist</code> function in the wrong manner and perhaps too many times. Try this: <pre class="prettyprint"><code>dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1)) rownames(dt) <- c("A","B","C","D") dt<-dist(dt) plot(hclust((dt))) </code></pre> <img src="https://i.stack.imgur.com/iIMKC.png" alt="enter image description here"> Effectively, you were calling <code>dist</code> on an object which was already a class of <code>dist</code> that you then turned into a matrix and then called <code>dist</code> on again within your call to <code>plot</code>. We can examine just the distance object as follows: <pre class="prettyprint"><code>> dt A B C B 2.236068 C 2.236068 1.414214 D 3.000000 2.828427 1.414214 </code></pre> There is no need to call <code>dist</code> on this object again before passing it to the <code>hclust</code> function.

How to draw hierarchical clustering?

Tags:

r

I have the following dataset:

data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
for(i in 1:nrow(data)){ data[i,i]<-NA}
colnames(data) <- c("A","B","C","D")
rownames(data) <- c("A","B","C","D")
plot(hclust(dist(data)))

and then the result is the below image:

enter image description here

But, I am wondering how this plot is drawn. Here, I am trying to obtain the dendrogram step by step. We know that the distance matrix at the begining is as follow:

enter image description here

Every time we find the two points with minimum distance, and then merge them as a single cluster

enter image description here

So, the first merge are B, and C.And we update the distance matrix

enter image description here

Again we find the 2 points with minimum distance, which is D with cluster of B,C

enter image description here

Again we update the distance matrix

enter image description here

As a result I should have the following merges

B, and C
B,C, and D
B,C,D, and A

But here the is a paradox with what R plot produced. So, how do you justify it?

532

asked Apr 28 '17 14:04

Sal-laS

1 Answers

Updated Response - Using `single` linkage rather than the default `complete` linkage.

I'll do my best to explain how I see this working. I believe this is as simple as the method argument used in hclust. The default method for hclust does not follow the algorithm that you laid out but we can adjust the method so it does.

But first, I am getting an error on the plot you are trying to make:

> data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
> for(i in 1:nrow(data)){ data[i,i]<-NA}
> colnames(data) <- c("A","B","C","D")
> rownames(data) <- c("A","B","C","D")
> plot(hclust(dist(data)))
Error in hclust(dist(data)) : 
  NA/NaN/Inf in foreign function call (arg 11)

What is your intention with the for(i in 1:nrow(data)){ data[i,i]<-NA} line? After that line, your data object looks like this:

   X  Y V3 V4
1 NA  1 NA NA
2  2 NA NA NA
3  3  2 NA NA
4  4  1 NA NA

However, if we can just start with the following code, we can generate the desired tree as follows:

dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1))
rownames(dt) <- c("A", "B", "C", "D")
dt<-dist(dt)
plot(hclust(dt, method = "single"))

enter image description here

NOTE the change in method on the hclust call to method = single. The default method is method = complete. The complete linkage method does not join clusters to nodes based on the shortest distance but on the longest intercluster distance. Extracting some material from the fantastic Introduction to Statistical Learning with Applications in R which describes the various linkage methods available:

enter image description here

This text, by James, Witten, Hastie, and Tibshirani, is available as a free download at the link above. The section on hierarchical clustering starts on page 390. Please let me know if this helps clear things up.

Original Response

I think you are calling the dist function in the wrong manner and perhaps too many times. Try this:

dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
rownames(dt) <- c("A","B","C","D")
dt<-dist(dt)
plot(hclust((dt)))

enter image description here

Effectively, you were calling dist on an object which was already a class of dist that you then turned into a matrix and then called dist on again within your call to plot.

We can examine just the distance object as follows:

> dt
         A        B        C
B 2.236068                  
C 2.236068 1.414214         
D 3.000000 2.828427 1.414214

There is no need to call dist on this object again before passing it to the hclust function.

175

answered Oct 01 '22 04:10

Nick Criswell

Related questions
                            
                                ggbiplot - change the group color and marker
                            
                                Extract rows for the first occurrence of a variable in a group
                            
                                devtools::check fails because of vignette building
                            
                                Specify which shell to use in R
                            
                                How to display ``` in knitr chunks?
                            
                                How to avoid R converting dates to numeric automatically?
                            
                                Substitute for mutate (dplyr package) in python pandas
                            
                                When does the argument go inside or outside aes()?
                            
                                importing poorly structured data in r
                            
                                Change the shape of action button in shiny
                            
                                writeRaster output file size
                            
                                replace values of a dataframe based on values of the previous rows
                            
                                Add Metadata to Seurat Object
                            
                                How to use data.table within functions and loops?
                            
                                R and xml2: how to read text that is not in children nodes and read information even if node is missing
                            
                                fread - multiple separators in a string
                            
                                Does the roxygen2 package support tables?
                            
                                Creating a reactive dataframe with shiny apps
                            
                                How to add footer in page generated using Shiny Flexdashboard
                            
                                3D Surface with Plot_ly in r, with x,y,z coordinates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to draw hierarchical clustering?

Tags:

r

Sal-laS

People also ask

1 Answers

Updated Response - Using `single` linkage rather than the default `complete` linkage.

Original Response

Nick Criswell

Recent Activity

Donate For Us

How to draw hierarchical clustering?

Tags:

r

Sal-laS

People also ask

1 Answers

Updated Response - Using single linkage rather than the default complete linkage.

Original Response

Nick Criswell

Related questions

Recent Activity

Donate For Us

Updated Response - Using `single` linkage rather than the default `complete` linkage.