I'm using hierarchical clustering to cluster word vectors, and I want the user to be able to display a dendrogram showing the clusters. However, since there can be thousands of words, I want this dendrogram to be truncated to some reasonable valuable, with the label for each leaf being a string of the most significant words in that cluster.
My problem is that, according to the docs, "The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster." I take this to mean I can't label clusters, only singular points?
To illustrate, here is a short python script which generates a simple labeled dendrogram:
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')
labelList = ["foo" for i in range(0, 20)]
plt.figure(figsize=(15, 12))
dendrogram(
linked,
orientation='right',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
plt.show()
Now let's say I want to truncate to just 5 leaves, and for each leaf, label it like "foo, foo, foo...", ie the words that make up that cluster. (Note: generating these labels is not the issue here.) I truncate it, and supply a label list to match:
labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
linked,
orientation='right',
p=5,
truncate_mode='lastp',
labels=labelList,
distance_sort='descending',
show_leaf_counts=False
)
and here's the problem, no labels:
I'm thinking there might be a use here for the parameter 'leaf_label_func' but I'm not sure how to use it.
To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.
A dendrogram is a type of tree diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.
In the dendrogram locate the largest vertical difference between nodes, and in the middle pass an horizontal line. The number of vertical lines intersecting it is the optimal number of clusters (when affinity is calculated using the method set in linkage).
The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed as a tree diagram called a dendrogram. They begin with each object in a separate cluster. At each step, the two clusters that are most similar are joined into a single new cluster.
you can simply write:
hierarchy.dendrogram(Z, labels=label_list)
Here is a good example, using pandas Data Frame :
import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
data = [[24, 16], [13, 4], [24, 11], [34, 18], [41,
6], [35, 13]]
frame = pd.DataFrame(np.array(data), columns=["Rape",
"Murder"], index=["Atlanta", "Boston", "Chicago",
"Dallas", "Denver", "Detroit"])
Z = hierarchy.linkage(frame, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z, labels=frame.index)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With