I have the following dataset:
data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
for(i in 1:nrow(data)){ data[i,i]<-NA}
colnames(data) <- c("A","B","C","D")
rownames(data) <- c("A","B","C","D")
plot(hclust(dist(data)))
and then the result is the below image:
But, I am wondering how this plot is drawn. Here, I am trying to obtain the dendrogram step by step. We know that the distance matrix at the begining is as follow:
Every time we find the two points with minimum distance, and then merge them as a single cluster
So, the first merge are B, and C.And we update the distance matrix
Again we find the 2 points with minimum distance, which is D
with cluster of B,C
Again we update the distance matrix
As a result I should have the following merges
But here the is a paradox with what R
plot produced. So, how do you justify it?
Select any cell in the data set, then on the XLMiner ribbon, from the Data Analysis tab, select Cluster - Hierarchical Clustering to open the Hierarchical Clustering dialog. From the Variables in Input Data list, select variables x1 through x8, then click > to move the selected variables to the Selected Variables list.
single
linkage rather than the default complete
linkage.I'll do my best to explain how I see this working. I believe this is as simple as the method
argument used in hclust.
The default method for hclust
does not follow the algorithm that you laid out but we can adjust the method
so it does.
But first, I am getting an error on the plot you are trying to make:
> data<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
> for(i in 1:nrow(data)){ data[i,i]<-NA}
> colnames(data) <- c("A","B","C","D")
> rownames(data) <- c("A","B","C","D")
> plot(hclust(dist(data)))
Error in hclust(dist(data)) :
NA/NaN/Inf in foreign function call (arg 11)
What is your intention with the for(i in 1:nrow(data)){ data[i,i]<-NA}
line? After that line, your data
object looks like this:
X Y V3 V4
1 NA 1 NA NA
2 2 NA NA NA
3 3 2 NA NA
4 4 1 NA NA
However, if we can just start with the following code, we can generate the desired tree as follows:
dt<-data.frame(X = c(1, 2, 3, 4), Y = c(1, 3, 2, 1))
rownames(dt) <- c("A", "B", "C", "D")
dt<-dist(dt)
plot(hclust(dt, method = "single"))
NOTE the change in method
on the hclust
call to method = single
. The default method
is method = complete
. The complete
linkage method does not join clusters to nodes based on the shortest distance but on the longest intercluster distance. Extracting some material from the fantastic Introduction to Statistical Learning with Applications in R which describes the various linkage methods available:
This text, by James, Witten, Hastie, and Tibshirani, is available as a free download at the link above. The section on hierarchical clustering starts on page 390. Please let me know if this helps clear things up.
I think you are calling the dist
function in the wrong manner and perhaps too many times. Try this:
dt<-data.frame(X=c(1,2,3,4),Y=c(1,3,2,1))
rownames(dt) <- c("A","B","C","D")
dt<-dist(dt)
plot(hclust((dt)))
Effectively, you were calling dist
on an object which was already a class of dist
that you then turned into a matrix and then called dist
on again within your call to plot
.
We can examine just the distance object as follows:
> dt
A B C
B 2.236068
C 2.236068 1.414214
D 3.000000 2.828427 1.414214
There is no need to call dist
on this object again before passing it to the hclust
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With