Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).

I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.

Here is an example of what dbscan find: dbscanfail

You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.

Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.

like image 363
astromax Avatar asked Nov 13 '13 14:11

astromax


People also ask

Is HDBScan faster than DBScan?

Below is a graph of several clustering algorithms, DBScan is the dark blue and HDBScan is the dark green. At the 200,000 record point, DBScan takes about twice the amount of time as HDBScan.

Does Python do clustering?

Python offers many useful tools for performing cluster analysis. The best tool to use depends on the problem at hand and the type of data available. There are three widely used techniques for how to form clusters in Python: K-means clustering, Gaussian mixture models and spectral clustering.


1 Answers

I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).

Here's an example of what it gives:

MeanShiftResults

I also want to point out that in the documentation is states that it may not scale well.

like image 117
astromax Avatar answered Sep 28 '22 17:09

astromax