Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDBSCAN won't utilize all available cpus. Processes just sleep

For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at around 1.5GB in CSV format. It's a mixture of ints, bools, and floats up to 9 digits.

During this period each time I've been able to get the data to cluster it has taken 3 plus days, which seems weird given HDBSCAN is revered for its speed and I'm running this on a Google Cloud Compute Instance with 96 cpus. I've spent days trying to get it to utilize the cloud instance's processing power but to no avail.

Using the auto algorithm detection in HDBSCAN, it selects the boruvka_kdtree as the best algorithm to use. And I've tried passing in all sorts of values to core_dist_n_jobs parameter. From -2,-1, 1, 96, multiprocessing.cpu_count(), to 300. All seem to have a similar effect of causing 4 main python processes to utilize a full core while spawning way more sleeping processes.

I refuse to believe I'm doing this right and this is truly how long this takes on this hardware. I'm convinced I must be missing something like an issue where using JupyterHub on the same machine causes some sort of GIL lock, or I'm missing some parameter for HDBSCAN.

Here is my current call to HDBSCAN:

hdbscan.HDBSCAN(min_cluster_size = 100000, min_samples = 500, algorithm='best', alpha=1.0, memory=mem, core_dist_n_jobs = multiprocessing.cpu_count())

I've followed all existing issues and posts related to this issue I could find and nothing has worked so far, but I'm always down to try even radical ideas, because this isn't even the full data I want to cluster and at this rate it would take 4 years years to cluster the full data!

like image 375
Marc Frankel Avatar asked Oct 31 '25 18:10

Marc Frankel


1 Answers

According to the author

Only the core distance computation can use all the cores, sadly that is apparently the first few seconds. The rest of the computation is quite challenging to parallelise unfortunately and will run on a single thread.

you can read the issues from the links below:

Not using all available CPUs?

core_dist_n_jobs =1 or -1 -> no difference at all and computation time extremely high

like image 62
EmreAydin Avatar answered Nov 03 '25 10:11

EmreAydin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!