Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

large scale clustering library possibly with python bindings

I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing.

I've been trying the following clustering implementations with no luck:

  • Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset)
  • scipy.cluster.hierarchy.fclusterdata (runs too long)
  • scipy.cluster.vq.kmeans (runs out of memory)
  • sklearn.cluster.hierarchical.Ward (runs too long)

Are there any other implementations which I might miss?

like image 743
tisch Avatar asked Jun 18 '12 23:06

tisch


People also ask

Which library is used for clustering in Python?

PyCaret's clustering module ( pycaret. clustering ) is an unsupervised machine learning module that performs the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups.

Which clustering can handle big data?

According to this research, k-means method is regarded as a viable approach for certain applications of big data clustering and has attracted many researchers than any other techniques.

Is HDBSCAN better than DBSCAN?

The main disavantage of DBSCAN is that is much more prone to noise, which may lead to false clustering. On the other hand, HDBSCAN focus on high density clustering, which reduces this noise clustering problem and allows a hierarchical clustering based on a decision tree approach.

Is K means clustering good for large datasets?

K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters.


1 Answers

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.

Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.

DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

like image 105
Has QUIT--Anony-Mousse Avatar answered Sep 26 '22 02:09

Has QUIT--Anony-Mousse