Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn DBSCAN memory usage

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.

I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)

db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata) 

My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)

My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?

Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!

Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.

Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.

Addendum #3(!): I forgot to attach the traceback, here it is,

Traceback (most recent call last):   File "tDBSCAN.py", line 34, in <module>     db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)   File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict     self.fit(X)   File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit     **self.get_params())   File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan     D = pairwise_distances(X, metric=metric)   File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances     return func(X, Y, **kwds)   File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances     D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :]) MemoryError 
like image 315
JamesT Avatar asked May 05 '13 05:05

JamesT


People also ask

How fast is DBSCAN?

it goes from 0.36 seconds to 92 minutes to run on the same data.

How is Hdbscan better than DBSCAN?

The main disavantage of DBSCAN is that is much more prone to noise, which may lead to false clustering. On the other hand, HDBSCAN focus on high density clustering, which reduces this noise clustering problem and allows a hierarchical clustering based on a decision tree approach.

Is scaling required for DBSCAN?

It depends on what you are trying to do. If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too. And yes, in particular a non-uniform scaling does distort distances.


2 Answers

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

So I see two choices:

  1. You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.

  2. Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

like image 177
Has QUIT--Anony-Mousse Avatar answered Sep 24 '22 10:09

Has QUIT--Anony-Mousse


You can do this using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm. You do not need to precompute a distance matrix.

This example clusters over a million GPS latitude-longitude points with DBSCAN/haversine and avoids memory usage problems:

df = pd.read_csv('gps.csv') coords = df.as_matrix(columns=['lat', 'lon']) db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords)) 

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed, which blows up your RAM real quick. But if you use Anaconda, you can quickly set this up with:

conda install scikit-learn=0.15 

Or, create a clean virtual environment for this clustering task:

conda create -n clusterenv python=3.4 scikit-learn=0.15 matplotlib pandas jupyter activate clusterenv 
like image 34
eos Avatar answered Sep 24 '22 10:09

eos