Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDBSCAN Python choose number of clusters

Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?

Thanks

UPDATE: here is the code to use fcluster and hdbscan

import hdbscan
from scipy.cluster.hierarchy import fcluster

clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')
like image 601
user1571823 Avatar asked Jan 15 '18 18:01

user1571823


2 Answers

Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.

To do so:

from hdbscan import flat

clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)

You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.

In addition, a jupyter notebook has also been written explaining how to use it, Here.

like image 111
Lib101 Avatar answered Nov 10 '22 10:11

Lib101


If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.

like image 24
Leland McInnes Avatar answered Nov 10 '22 10:11

Leland McInnes