Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Swap leafs of Python scipy's dendrogram/linkage

I generated a dendrogram plot for my dataset and I am not happy how the splits at some levels have been ordered. I am thus looking for a way to swap the two branches (or leaves) of a single split.

If we look at the code and dendrogram plot at the bottom, there are two labels 11 and 25 split away from the rest of the big cluster. I am really unhappy about this, and would like that the branch with 11 and 25 to be the right branch of the split and the rest of the cluster to be the left branch. The shown distances would still be the same, and thus the data would not be changed, just the aesthetics.

Can this be done? And how? I am specifically for a manual intervention because the optimal leaf ordering algorithm supposedly does not work in this case.

import numpy as np

# random data set with two clusters
np.random.seed(65)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)

# create linkage and plot dendrogram    
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')

plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=12.,  # font size for the x axis labels
)
plt.show()

enter image description here

like image 378
dmeu Avatar asked Nov 08 '22 11:11

dmeu


1 Answers

I had a similar problem and got solved by using optimal_ordering option in linkage. I attach the code and result for your case, which might not be exactly what you like but seems highly improved to me.

import numpy as np
import matplotlib.pyplot as plt

# random data set with two clusters
np.random.seed(65)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)

# create linkage and plot dendrogram    
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward', optimal_ordering = True)

plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=12.,  # font size for the x axis labels
    distance_sort=False,
    show_leaf_counts=True,
    count_sort=False
)
plt.show()

result of using optimal_ordering in linkage

like image 120
rtomas Avatar answered Nov 28 '22 12:11

rtomas