In scikit-learn, can DBSCAN use sparse matrix?

3 Answers

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.

answered Nov 05 '22 23:11

Has QUIT--Anony-Mousse

Yes, since version 0.16.1. Here's a commit for a test:

https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879

answered Nov 06 '22 00:11

K.-Michael Aye

You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)

However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.

(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)

answered Nov 05 '22 23:11

Fred Foo

Related questions
                            
                                How to handle categorical variables in sklearn GradientBoostingClassifier?
                            
                                How to disable keras warnings?
                            
                                Noisy training loss
                            
                                How to install tensorflow GPU version on VirtualBox Ubuntu OS. And host OS is windows 10
                            
                                Imbalanced classes in multi-class classification problem
                            
                                What machine learning benchmarks are out there?
                            
                                Ordered Logit in Python?
                            
                                Making a meaningful sentence from a given set of words [closed]
                            
                                Weighted linear regression with Scikit-learn
                            
                                What is stratified bootstrap?
                            
                                String Distance Matrix in Python
                            
                                What is the purpose of weights and biases in tensorflow word2vec example?
                            
                                Loss on masked tensors
                            
                                Why is accuracy from fit_generator different to that from evaluate_generator in Keras?
                            
                                Visualizing a decision tree ( example from scikit-learn )
                            
                                Retrieving the optimal number of clusters in R
                            
                                Uniformly shuffle 5 gigabytes of numpy data
                            
                                Neural network backprop not fully training
                            
                                PyTorch : predict single example
                            
                                Trying to write my own Neural Network in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In scikit-learn, can DBSCAN use sparse matrix?

Tags:

machine-learning

cluster-analysis

scikit-learn

data-mining

dbscan

user2147650

People also ask

3 Answers

Has QUIT--Anony-Mousse

K.-Michael Aye

Fred Foo

Recent Activity

Donate For Us