Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Display cluster labels for a scipy dendrogram

I'm using hierarchical clustering to cluster word vectors, and I want the user to be able to display a dendrogram showing the clusters. However, since there can be thousands of words, I want this dendrogram to be truncated to some reasonable valuable, with the label for each leaf being a string of the most significant words in that cluster.

My problem is that, according to the docs, "The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster." I take this to mean I can't label clusters, only singular points?

To illustrate, here is a short python script which generates a simple labeled dendrogram:

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')

labelList = ["foo" for i in range(0, 20)]

plt.figure(figsize=(15, 12))
dendrogram(
            linked,
            orientation='right',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )
plt.show()

a dendrogram of randomly generated points

Now let's say I want to truncate to just 5 leaves, and for each leaf, label it like "foo, foo, foo...", ie the words that make up that cluster. (Note: generating these labels is not the issue here.) I truncate it, and supply a label list to match:

labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
            linked,
            orientation='right',
            p=5,
            truncate_mode='lastp',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )

and here's the problem, no labels:

enter image description here

I'm thinking there might be a use here for the parameter 'leaf_label_func' but I'm not sure how to use it.

like image 652
EmmetOT Avatar asked Mar 08 '16 16:03

EmmetOT


People also ask

How do you find the cluster of a dendrogram?

To get the optimal number of clusters for hierarchical clustering, we make use a dendrogram which is tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged, the dendrogram will join them in a graph and the height of the join will be the distance between those clusters.

What does a cluster dendrogram show?

A dendrogram is a type of tree diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.

How do you find the number of clusters in a dendrogram Python?

In the dendrogram locate the largest vertical difference between nodes, and in the middle pass an horizontal line. The number of vertical lines intersecting it is the optimal number of clusters (when affinity is calculated using the method set in linkage).

What clustering technique represents clusters with a dendrogram?

The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed as a tree diagram called a dendrogram. They begin with each object in a separate cluster. At each step, the two clusters that are most similar are joined into a single new cluster.


1 Answers

you can simply write:

hierarchy.dendrogram(Z, labels=label_list)

Here is a good example, using pandas Data Frame :

import numpy as np
import pandas as pd
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

data = [[24, 16], [13, 4], [24, 11], [34, 18], [41, 
6], [35, 13]]
frame = pd.DataFrame(np.array(data), columns=["Rape", 
"Murder"], index=["Atlanta", "Boston", "Chicago", 
"Dallas", "Denver", "Detroit"])

Z = hierarchy.linkage(frame, 'single')
plt.figure()
dn = hierarchy.dendrogram(Z, labels=frame.index)
like image 80
Mohammad Forouhesh Avatar answered Sep 21 '22 00:09

Mohammad Forouhesh