Clustering a billion items (or which clustering methods run in linear time?)

Tags:

I have a billion feature vectors and I would like to put them into approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering for example it is not at all clear to me how their running time scales with the data size (except for Affinity Propagation which is clearly too slow).

What methods are suitable for clustering such a large data set? I assume any method will have to run in O(n) time.

444

asked Sep 15 '15 19:09

graffe

1 Answers

The K-means complexity sounds reasonable for your data (only 4 components). The tricky part is the initialization and the choice of number of clusters. You can try different random initialization but this can be time consuming. An alternative is to sub-sample your data and run a more expensive clustering algorithm like Affinity Propagation. Then use the solution as init for k-means and run it with all your data.

160

answered Sep 28 '22 19:09

Mikael Rousson

Related questions
                            
                                Boost.Python Multiple Return Arguments
                            
                                Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back
                            
                                find_all with camelCase tag names with BeautifulSoup 4
                            
                                What does (numpy) __array_wrap__ do?
                            
                                setting server_default in sqlalchemy fails
                            
                                How to calculate FactorAnalysis scores using Python (scikit-learn)?
                            
                                Separating development and production parts of django project
                            
                                Git add through python subprocess
                            
                                Can't activate Python venv in Windows 10
                            
                                Quantile functions in Python
                            
                                How to test if the right exception is raised and caught using unit testing?
                            
                                Python - check if currently handling an exception
                            
                                How do I disable django migration debug logging?
                            
                                python sqllite auto increment still asking for the id field on insert
                            
                                How to get form POST input in Tornado?
                            
                                AttributeError: datetime.date object has no Attribute '__dict__'
                            
                                How do I send a websocket message in Tornado at will?
                            
                                python3 messes up terminal
                            
                                Multiprocessing program has AttributeError in Anaconda notebook
                            
                                Python, Macports, and Buffer Problems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clustering a billion items (or which clustering methods run in linear time?)

Tags:

python

machine-learning

graffe

People also ask

1 Answers

Mikael Rousson

Recent Activity

Donate For Us