Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cluster assignments differ sometimes in two DBSCAN implementations

I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:

n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))

Clustering is done with parameters as below:

eps = 0.2
MinPts = 5

I am comparing the cluster assignments of the fpc::dbscan with my implementation of dbscan . Maximum of the runs shows every point was classified identically by both implementations.

But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.

I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.

Plot of clusters.

Top inset is fpc::dbscan, bottom inset is my dbscan implementation

Plot of clusters. Top inset is fpc::dbscan, bottom inset is my dbscan implementation

Note The point which differs in my implementation is marked with an exclamation mark (!) I am also uploading zoomed images of the mismatch section:


My dbscan implementation output

+ are core points

o are border points

- are noise points

! highlights the differing point

my dbscan implementation


fpc::dbscan implementation output

triangles are core points coloured circles are border points black circles are noise points enter image description here


Another example:

My dbscan implementation output

enter image description here


fpc::dbscan implementation output

enter image description here


EDIT

Equal x-y scaled example

As requested by Anony-Mousse

In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:

fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly

enter image description here

my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly

enter image description here

Question

  • I am new into cluster analysis therefore i have another question: is these type of difference allowable?

  • In my implementation i am scanning from the first point to the last point as it is supplied, also in fpc::dbscan the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by !) from the same cluster center. Also i have generates some cases in which fpc::dbscan marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?

Code segments on request.

like image 441
phoxis Avatar asked Jun 02 '12 08:06

phoxis


1 Answers

DBSCAN is known to be order dependant for border points. They will be assigned to the cluster they are first discovered from. If a border point is not dense, but in the vincinity of two dense points from different clusters, it can be assigned to either.

This is why DBSCAN is often described as "order independent, except for border points".

Try shuffling the data (or reversing!), then rerunning your algorithm. The results should change.

As I assume neither your nor the fpc implementation has index support (to speed up range queries and make the algorithm run in O(n log n)), I'd guess that one of the implementations is processing the points in forward order, the other one in backward order. '''Update: indexes should not play much of a role, as they don't change the order across clusters, only within one cluster'''.

Another option for "generating" this difference is to

  • keep the first (non-noise) cluster assignment of each point (IIRC official DBSCAN pseudocode)
  • keep the last cluster assignment of each point (fbc::dbscan seems to do this)

These will also generate different results on objects that are border points to more than once cluster. There also is the possibility to assign these points to both cluters, which will yield a non-strict partitioning of the data set. Usually, the benefits of having a strict partitioning are more important than having a fully deterministic result.

Don't get me wrong: the "overwrite" strategy of fbc::dbscan doesn't substantially change the results. I would probably even implement it that way myself.

Are any non-border points affected?

like image 57
Has QUIT--Anony-Mousse Avatar answered Sep 30 '22 07:09

Has QUIT--Anony-Mousse