Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OPTICS Clustering algorithm. How to get the best epsilon

I am implementing a project which needs to cluster geographical points. OPTICS algorithm seems to be a very nice solution. It needs just 2 parameters as input(MinPts and Epsilon), which are, respectively, the minimum number of points needed to consider them as a cluster, and the distance value used to compare if two points are in can be placed in same cluster.

My problem is that, due to the extreme variety of the points, I can't set a fixed epsilon. Just look at the image below.

the problem

The same points structure but in a different scale would result very different. Suppose to set MinPts=2 and epsilon = 1Km. On the left, the algorithm would create 2 clusters(red and blue), but on the right it would create one single cluster containing all of the points(red), but I would like to obtain 2 clusters even on the right.

So my question is: is there any kind of way to calculate dynamically the epsilon value to get this result?

EDIT 05 June 2012 3.15pm: I thought I was using the OPTICS algorithm implementation from the javaml library, but it seems it is actually a DBSCAN algorithm implementation. So the question now is: does anybody know a java based implementation of OPTICS algorithm?

Thank you very much and excuse my for my poor english.

Marco

like image 554
smellyarmpits Avatar asked Jun 04 '12 16:06

smellyarmpits


People also ask

How do you know which epsilon is best for DBSCAN?

In layman's terms, we find a suitable value for epsilon by calculating the distance to the nearest n points for each point, sorting and plotting the results. Then we look to see where the change is most pronounced (think of the angle between your arm and forearm) and select that as epsilon.

What is epsilon in clustering?

epsilon = clusterDBSCAN. estimateEpsilon( X , MinNumPoints , MaxNumPoints ) returns an estimate of the neighborhood clustering threshold, epsilon , used in the density-based spatial clustering of applications with noise (DBSCAN)algorithm. epsilon is computed from input data X using a k-nearest neighbor (k-NN) search.

How does the epsilon value affect the DBSCAN clustering algorithm?

In other words, it is the distance that DBSCAN uses to determine if two points are similar and belong together. A larger epsilon will produce broader clusters (encompassing more data points) and a smaller epsilon will build smaller clusters.

How do you find epsilon and MinPts in DBSCAN?

epsilon (ε): The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = MinPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an “elbow”.


3 Answers

The epsilon value in OPTICS is solely to limit the runtime complexity when using index structures. If you do not have an index for acceleration, you can set it to infinity.

To quote Wikipedia on OPTICS

The parameter \varepsilon is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity.

What you seem to have looks much more like DBSCAN than OPTICS. In OPTICS, you should not need to choose epsilon (it should have been called max-epsilon by the authors!), but your cluster extraction method will take care of that. Are you using the Xi extraction proposed in the OPTICS paper?

minPts is much more important. You should try a value of at least 5 or 10, not 2. With 2, you are essentially performing single-linkage clustering!

The example you gave above should work fine once you increase minPts!

Re: edit: As you can even see in the Wikipedia article, ELKI has a proper OPTICS implementation and it's in Java.

like image 128
Has QUIT--Anony-Mousse Avatar answered Oct 19 '22 02:10

Has QUIT--Anony-Mousse


You'd can try to scale epsilon by the total size of the enclosing rectangle. For example, your left data is about 4km x 6km (using my Mark I eyeball to measure) and the right is about 2km x 2km. So, epsilon on the right should be about 2.5 times smaller.

Of course, this doesn't work reliably. If, on your right hand data, there were an additional single point 4km to the right and 2km down, that would make the enclosing rectangle for the right the same as on the left, and you'd get similar (wrong) results.

like image 29
user949300 Avatar answered Oct 19 '22 02:10

user949300


You can try a minimum spanning tree and then remove the longest edge. The remaining spanning tree and the center of them is the best center for OPTICS and you can count the numbers of points around it.

like image 25
Micromega Avatar answered Oct 19 '22 00:10

Micromega