Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering a billion items (or which clustering methods run in linear time?)

I have a billion feature vectors and I would like to put them into approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering for example it is not at all clear to me how their running time scales with the data size (except for Affinity Propagation which is clearly too slow).

What methods are suitable for clustering such a large data set? I assume any method will have to run in O(n) time.

like image 444
graffe Avatar asked Sep 15 '15 19:09

graffe


People also ask

Is Kmeans clustering linear?

k-means clustering is not invariant to linear transformations of the data. The optimal linear transformation for clustering will stretch the distribution so that the primary direction of variability aligns with actual differences in the clusters.

What are the clustering methods of clustering?

Under clustering analysis, the first set of objects are categorized into groups based on similarity and then assign labels to the groups. Partitioning based, hierarchical based, density-based-, grid-based-, and model-based clustering are the clustering methods.

Which type of clustering is used for big data?

According to this research, k-means method is regarded as a viable approach for certain applications of big data clustering and has attracted many researchers than any other techniques.


1 Answers

The K-means complexity sounds reasonable for your data (only 4 components). The tricky part is the initialization and the choice of number of clusters. You can try different random initialization but this can be time consuming. An alternative is to sub-sample your data and run a more expensive clustering algorithm like Affinity Propagation. Then use the solution as init for k-means and run it with all your data.

like image 160
Mikael Rousson Avatar answered Sep 28 '22 19:09

Mikael Rousson