Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to scipy.cluster.hierarchy.cut_tree()

Tags:

I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).

However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage matrix Z? My question boils down to: how can I compute what cut_tree() is computing from scratch with no bugs?

You can test your code with this dataset.

from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist

## Load dataset
X = np.load("dataset.npy")

## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')

print(is_valid_linkage(Z))

## Now let's say we want the flat cluster assignement with 10 clusters.
#  If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)

Sidenote: An alternative approach could maybe be using rpy2's cutree() as a substitute for scipy's cut_tree(), but I never used it. What do you think?

like image 516
PDRX Avatar asked Oct 22 '17 01:10

PDRX


1 Answers

One way to obtain k flat clusters is to use scipy.cluster.hierarchy.fcluster with criterion='maxclust':

from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')
like image 66
σηγ Avatar answered Sep 27 '22 16:09

σηγ