Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?
Thanks
UPDATE: here is the code to use fcluster and hdbscan
import hdbscan
from scipy.cluster.hierarchy import fcluster
clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')
Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.
To do so:
from hdbscan import flat
clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)
You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.
In addition, a jupyter notebook has also been written explaining how to use it, Here.
If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With