I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree()
is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).
However, I need to be able to get a flat clustering with an assignment of k
different labels to my datapoints. Do you know the algorithm to get a flat clustering with k
labels from an arbitrary input linkage matrix Z
? My question boils down to: how can I compute what cut_tree()
is computing from scratch with no bugs?
You can test your code with this dataset.
from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist
## Load dataset
X = np.load("dataset.npy")
## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')
print(is_valid_linkage(Z))
## Now let's say we want the flat cluster assignement with 10 clusters.
# If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)
Sidenote: An alternative approach could maybe be using rpy2's cutree()
as a substitute for scipy's cut_tree()
, but I never used it. What do you think?
One way to obtain k
flat clusters is to use scipy.cluster.hierarchy.fcluster
with criterion='maxclust'
:
from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With