I have a billion feature vectors and I would like to put them into approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering for example it is not at all clear to me how their running time scales with the data size (except for Affinity Propagation which is clearly too slow).
What methods are suitable for clustering such a large data set? I assume any method will have to run in O(n) time.
k-means clustering is not invariant to linear transformations of the data. The optimal linear transformation for clustering will stretch the distribution so that the primary direction of variability aligns with actual differences in the clusters.
Under clustering analysis, the first set of objects are categorized into groups based on similarity and then assign labels to the groups. Partitioning based, hierarchical based, density-based-, grid-based-, and model-based clustering are the clustering methods.
According to this research, k-means method is regarded as a viable approach for certain applications of big data clustering and has attracted many researchers than any other techniques.
The K-means complexity sounds reasonable for your data (only 4 components). The tricky part is the initialization and the choice of number of clusters. You can try different random initialization but this can be time consuming. An alternative is to sub-sample your data and run a more expensive clustering algorithm like Affinity Propagation. Then use the solution as init for k-means and run it with all your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With