Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit DBSCAN eps and min_sample value determination

I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:

Input:

db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)

output:

Estimated number of clusters: 1

Input:

db=DBSCAN(eps=27.0,min_samples=100).fit(X)

Output:

Estimated number of clusters: 1

Also so other information:

The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653

Also the X passed in the code is not the distance matrix but the matrix of feature vectors. So please tell me how do i determine these parameters

like image 907
Rakesh Sharma Avatar asked Dec 25 '22 02:12

Rakesh Sharma


1 Answers

  1. plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article. (Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)

  2. Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.

like image 144
Has QUIT--Anony-Mousse Avatar answered Dec 29 '22 05:12

Has QUIT--Anony-Mousse