large scale clustering library possibly with python bindings

1 Answers

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.

Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.

DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

105

answered Sep 26 '22 02:09

Has QUIT--Anony-Mousse

Related questions
                            
                                How to log exceptions in appengine?
                            
                                Python: detect duplicates using a set
                            
                                Webpage redirect to the main page with CGI Python
                            
                                elegant way of using a range using an if statement?
                            
                                Python TimedRotatingFileHandler logs to a file and stderr
                            
                                Python zeromq -- Multiple Publishers To a Single Subscriber?
                            
                                SQLite3 and Multiprocessing
                            
                                Django - Template display model verbose_names & objects
                            
                                Using Mock() in Python
                            
                                Calculate number of days between two dates inside Django templates
                            
                                How do I inspect a Python's class hierarchy?
                            
                                How can I improve the efficiency of this numpy loop
                            
                                TypeError: argument of type 'int' is not iterable
                            
                                Efficiently create 2d histograms from large datasets
                            
                                Recommended way to initialize variable in if block
                            
                                Which of lxml and libxml2 is better for parsing malformed html in Python?
                            
                                Pythonic way to eval all octal values in a string as integers
                            
                                Python: Calculate factorial of a non-integral number
                            
                                Differences between subprocess module, envoy, sarge and pexpect?
                            
                                Extract part of 2D-List/Matrix/List of lists in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

large scale clustering library possibly with python bindings

Tags:

python

cluster-analysis

data-mining

tisch

People also ask

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us