I have objects and a distance function, and want to cluster these using DBSCAN
method in scikit-learn
. My objects don't have a representation in Euclidean space. I know, that it is possible to useprecomputed
metric, but in my case it's very impractical, due to large size of distance matrix. Is there any way to overcome this in scikit-learn
? Maybe, are there another python implementations of DBSCAN that can do so?
scikit-learn has support for a large variety of metrics.
Some of them can be accelerated using the kdtree (very fast), using the ball tree (fast), using precomputed distance matrixes (fast, but needs a lot of memory) or no precomputation but Cython implementations (quadratic runtime) or even python callbacks (very slow).
This last option that is implemented but extremely slow:
def mydistance(x,y):
return numpy.sum((x-y)**2)
labels = DBSCAN(eps=eps, min_samples=minpts, metric=mydistance).fit_predict(X)
is, unfortunately, much much much much slower than
labels = DBSCAN(eps=eps, min_samples=minpts, metric='euclidean').fit_predict(X)
I found ELKI to perform much better when you need to use your own distance functions. Java can compile them into near native code speed using the Hotspot JNI compiler. Python (currently) cannot do this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With