Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DBSCAN with custom metric

I have the following given:

  • a dataset in the range of thousands

  • a way of computing the similarity, but the datapoints themselves I cannot plot them in euclidian space

I know that DBSCAN should support custom distance metric but I dont know how to use it.

say I have a function

def similarity(x,y):
    return  similarity ... 

and I have a list of data that can be passed pairwise into that function, how do I specify this when using the DBSCAN implementation of scikit-learn ?

Ideally what I want to do is to get a list of the clusters but I cant figure out how to get started in the first place.

There is a lot of terminology that still confuses me:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

How do I pass a feature array and what is it ? How do I fit this implementation to my needs ? How will I be able to get my "sublists" from this algorithm ?

like image 238
zython Avatar asked Feb 13 '18 13:02

zython


People also ask

Why DBSCAN has no predict method?

Because there is no labeled training data available for clustering. It has to make up new labels for the data, based on what it sees. But you can't do this on a single instance, you can only "bulk predict".

What is leaf size in DBSCAN?

leaf_size : int, optional (default = 30) Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

What is DBSCAN EPS?

eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors. minPoints: the minimum number of points to form a dense region.


1 Answers

A "feature array" is simply an array of the features of a datapoint in your dataset.

metric is the parameter you're looking for. It can be a string (the name of a builtin metric), or a callable. Your similarity function is a callable. This isn't well described in the documentation, but a metric has to do just that, take two datapoints as parameters, and return a number.

def similarity(x, y):
    return ...

reduced_dataset = sklearn.cluster.DBSCAN(metric=similarity).fit(dataset)
like image 119
j4nw Avatar answered Sep 27 '22 20:09

j4nw