I have implemented the DBSCAN algorithm in R, and i am matching the cluster assignments with the DBSCAN implementation of the fpc library. Testing is done on synthetic data which is generated as given in the fpc library dbscan example:
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n, sd=0.3))
Clustering is done with parameters as below:
eps = 0.2
MinPts = 5
I am comparing the cluster assignments of the fpc::dbscan
with my implementation of dbscan
. Maximum of the runs shows every point was classified identically by both implementations.
But there are some cases where 1 or 2 points and some rare times 5 or 6 points are assigned to different clusters in my implementation than that in the fpc implementation. I have noticed that only border points classification differs. After plotting i have seen that the points whose cluster membership does not match in the implementations are in such a position, such that it can be assigned to any of its surrounding clusters, depending on from which cluster's seed point it was discovered first.
I am showing an image with 150 points (to avoid clutter), where 1 point classification differs. Note that mismatch point cluster number is always greater in my implementation than the fpc implementation.
Top inset is fpc::dbscan, bottom inset is my dbscan implementation
Note The point which differs in my implementation is marked with an exclamation mark (!) I am also uploading zoomed images of the mismatch section:
+
are core points
o
are border points
-
are noise points
!
highlights the differing point
triangles are core points coloured circles are border points black circles are noise points
EDIT
As requested by Anony-Mousse
In different cases sometimes it seems that my implementation has classified the mismatch point correctly and sometimes it seems fpc implementation has classified the mismatch correctly. See below:
fpc::dbscan (with the triangle plot ones) seems to have classified the mismatch point correctly
my dbscan implementation (with + plot ones) seems to have classified the mismatch point correctly
I am new into cluster analysis therefore i have another question: is these type of difference allowable?
In my implementation i am scanning from the first point to the last point as it is supplied, also in fpc::dbscan
the points are scanned in the same order. In such case both of the implementation should have discovered the mismatch point (marked by !
) from the same cluster center. Also i have generates some cases in which fpc::dbscan
marks a point as noise, but my implementation assigns it to some clusters. In this case why is this difference occurring?
Code segments on request.
DBSCAN is known to be order dependant for border points. They will be assigned to the cluster they are first discovered from. If a border point is not dense, but in the vincinity of two dense points from different clusters, it can be assigned to either.
This is why DBSCAN is often described as "order independent, except for border points".
Try shuffling the data (or reversing!), then rerunning your algorithm. The results should change.
As I assume neither your nor the fpc implementation has index support (to speed up range queries and make the algorithm run in O(n log n)
), I'd guess that one of the implementations is processing the points in forward order, the other one in backward order. '''Update: indexes should not play much of a role, as they don't change the order across clusters, only within one cluster'''.
Another option for "generating" this difference is to
fbc::dbscan
seems to do this)These will also generate different results on objects that are border points to more than once cluster. There also is the possibility to assign these points to both cluters, which will yield a non-strict partitioning of the data set. Usually, the benefits of having a strict partitioning are more important than having a fully deterministic result.
Don't get me wrong: the "overwrite" strategy of fbc::dbscan
doesn't substantially change the results. I would probably even implement it that way myself.
Are any non-border points affected?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With