I have the following given:
a dataset in the range of thousands
a way of computing the similarity, but the datapoints themselves I cannot plot them in euclidian space
I know that DBSCAN should support custom distance metric but I dont know how to use it.
say I have a function
def similarity(x,y):
return similarity ...
and I have a list of data that can be passed pairwise into that function, how do I specify this when using the DBSCAN implementation of scikit-learn ?
Ideally what I want to do is to get a list of the clusters but I cant figure out how to get started in the first place.
There is a lot of terminology that still confuses me:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
How do I pass a feature array and what is it ? How do I fit this implementation to my needs ? How will I be able to get my "sublists" from this algorithm ?
Because there is no labeled training data available for clustering. It has to make up new labels for the data, based on what it sees. But you can't do this on a single instance, you can only "bulk predict".
leaf_size : int, optional (default = 30) Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors. minPoints: the minimum number of points to form a dense region.
A "feature array" is simply an array of the features of a datapoint in your dataset.
metric
is the parameter you're looking for. It can be a string (the name of a builtin metric), or a callable. Your similarity
function is a callable. This isn't well described in the documentation, but a metric has to do just that, take two datapoints as parameters, and return a number.
def similarity(x, y):
return ...
reduced_dataset = sklearn.cluster.DBSCAN(metric=similarity).fit(dataset)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With