Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In scikit-learn, can DBSCAN use sparse matrix?

I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix.

(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)

Anyway I found sparse matrix and feature extraction of scikit.

http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html

But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?

If anyone knows how to use sparse matrix in DBSCAN, please tell me. Or you can tell me a more suitable cluster method.

like image 999
user2147650 Avatar asked Apr 19 '13 01:04

user2147650


People also ask

What is sparse matrix in Scikit learn?

A sparse matrix is a matrix that is comprised of mostly zero values. Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices. A matrix is sparse if many of its coefficients are zero.

Is DBSCAN is capable to discover clusters of any shapes?

DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.

What is DBSCAN Sklearn?

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density. Read more in the User Guide.

What is DBSCAN for which situation you suggest the DBSCAN clustering?

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for density-based clustering. It can discover clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers.


3 Answers

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.

like image 85
Has QUIT--Anony-Mousse Avatar answered Nov 05 '22 23:11

Has QUIT--Anony-Mousse


Yes, since version 0.16.1. Here's a commit for a test:

https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879

like image 43
K.-Michael Aye Avatar answered Nov 06 '22 00:11

K.-Michael Aye


You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)

However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.

(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)

like image 34
Fred Foo Avatar answered Nov 05 '22 23:11

Fred Foo