I have been experimenting with Hierarchical Clustering
and in R
it's so simple hclust(as.dist(X),method="average")
. I found a method in Python
that is pretty simple as well, except I'm a little confused on what's going on with my input distance matrix.
I have a similarity matrix (DF_c93tom
w/ a smaller test version called DF_sim
) that I convert into a dissimilarity matrix DF_dissm = 1 - DF_sim
.
I use this as input into linkage
from scipy
but the documentation says it takes in a square or triangle matrix. I get a different cluster for inputing a lower triangle
, upper triangle
, and square matrix
. Why is this? It wants an upper triangle from the documentation but the lower triangle cluster looks REALLY similar.
My question, why are all the clusters different? Which one is correct?
This is the documentation for the input distance matrix for linkage
y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix.
Here is my code:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline
#Test Data
DF_sim = DF_c93tom.iloc[:10,:10] #Similarity Matrix
DF_sim.columns = DF_sim.index = range(10)
#print(DF_test)
# 0 1 2 3 4 5 6 7 8 9
# 0 1.000000 0 0.395833 0.083333 0 0 0 0 0 0
# 1 0.000000 1 0.000000 0.000000 0 0 0 0 0 0
# 2 0.395833 0 1.000000 0.883792 0 0 0 0 0 0
# 3 0.083333 0 0.883792 1.000000 0 0 0 0 0 0
# 4 0.000000 0 0.000000 0.000000 1 0 0 0 0 0
# 5 0.000000 0 0.000000 0.000000 0 1 0 0 0 0
# 6 0.000000 0 0.000000 0.000000 0 0 1 0 0 0
# 7 0.000000 0 0.000000 0.000000 0 0 0 1 0 0
# 8 0.000000 0 0.000000 0.000000 0 0 0 0 1 0
# 9 0.000000 0 0.000000 0.000000 0 0 0 0 0 1
#Dissimilarity Matrix
DF_dissm = 1 - DF_sim
#Redundant Matrix
#np.tril(DF_dissm).T == np.triu(DF_dissm)
#True for all values
#Hierarchical Clustering for square and triangle matrices
fig_1 = plt.figure(1)
plt.title("Square")
Z_square = linkage((DF_dissm.values),method="average")
dendrogram(Z_square)
fig_2 = plt.figure(2)
plt.title("Triangle Upper")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
dendrogram(Z_triu)
fig_3 = plt.figure(3)
plt.title("Triangle Lower")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
dendrogram(Z_tril)
plt.show()
When a 2D array is passed as the first argument to scipy.cluster.hierarchy.linkage,
it is treated as a sequence of observations, and scipy.spatial.pdist
is used to convert it to a squence of pairwise distances between observations.
There is a github issue regarding this behavior since it means that passing a "distance matrix" such as DF_dissm.values
(silently) produces an incorrect result.
So the upshot of this is that none of these
Z_square = linkage((DF_dissm.values),method="average")
Z_triu = linkage(np.triu(DF_dissm.values),method="average")
Z_tril = linkage(np.tril(DF_dissm.values),method="average")
produce the desired result. Instead use
np.triu_indices
:
h, w = arr.shape
Z = linkage(arr[np.triu_indices(h, 1)], method="average")
or spatial.distance.squareform
:
from scipy.spatial import distance as ssd
Z = linkage(ssd.squareform(arr), method="average")
or apply spatial.distance.pdist
to the original points:
Z = hierarchy.linkage(ssd.pdist(points), method="average")
or pass the 2D array points
:
Z = hierarchy.linkage(points, method="average")
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster import hierarchy as hier
from scipy.spatial import distance as ssd
np.random.seed(2016)
points = np.random.random((10, 2))
arr = ssd.cdist(points, points)
fig, ax = plt.subplots(nrows=4)
ax[0].set_title("condensed upper triangular")
Z = hier.linkage(arr[np.triu_indices(arr.shape[0], 1)], method="average")
hier.dendrogram(Z, ax=ax[0])
ax[1].set_title("squareform")
Z = hier.linkage(ssd.squareform(arr), method="average")
hier.dendrogram(Z, ax=ax[1])
ax[2].set_title("pdist")
Z = hier.linkage(ssd.pdist(points), method="average")
hier.dendrogram(Z, ax=ax[2])
ax[3].set_title("sequence of observations")
Z = hier.linkage(points, method="average")
hier.dendrogram(Z, ax=ax[3])
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With