I am working with GPS data (latitude, longitude). For density based clustering I have used DBSCAN in R.
Advantages of DBSCAN in my case:
I can calculate a distance matrix (using Haversine Distance Formula) and use that as input in dbscan
library(fossil)
dist<- earth.dist(df, dist=T) #df is dataset containing lat long values
library(fpc)
dens<-dbscan(dist,MinPts=25,eps=0.43,method="dist")
Now, when I look at the clusters, they are not meaningful. Some clusters have points which are more than 1km apart. I want dense clusters but not that big in size.
Different values of MinPts
and eps are taken care of and I have also used k nearest neighbor distance graph to get an optimum value of eps
for MinPts
=25
What dbscan
is doing is going to every point in my dataset and if point p has MinPts
in its eps
neighborhood it will make a cluster but at the same time it is also joining the clusters which are density reachable (which I guess are creating a problem for me).
It really is a big question, particularly "how to reduce size of a cluster without affecting its information too much", but I will write it down as the following points:
dens$cluster
, but how would I know if a
particular point is core or border?eps
. But that's not the case because density reachable
clusters are combined together.dbscan
but can give me more meaningful clusters?OPTICS
is another alternative but will it solve my issue?
Note:
By meaningful I want to say closer points should be in a cluster. But points which are 1km or more apart should not be in the same cluster.
DBSCAN algorithm DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
DBSCAN is a typically used clustering algorithm due to its clustering ability for arbitrarily-shaped clusters and its robustness to outliers. Generally, the complexity of DBSCAN is O(n^2) in the worst case, and it practically becomes more severe in higher dimension.
The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular learning method utilized in model building and machine learning algorithms. This is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density.
DBSCAN doesn't claim the radius is the maximum cluster size.
Have you read the article? It's looking for arbitrarily shaped clusters; eps
is just the core size of a point; roughly the size used for density estimation; any point within this radius of a core point will be part of a cluster.
This makes it essentially the maximum step size to connect dense points. But they may still form a chain of density connected points, of arbitary shape or size.
I don't know what cluster 0 is in your R implementation. I've experimented with the R implementation, but it was waaaay slower than all the others. I don't recommend using R, there are much better tools for cluster analysis available, such as ELKI. Try running DBSCAN with your settings on ELKI, with LatLngDistanceFunction and and sort-tile-recursive loaded R-tree index. You'll be surprised how fast it can be, compared to R.
OPTICS is looking for the same density connected type of clusters. Are you sure this arbitrarily-shaped type of clusters is what you are looking for?
IMHO, you are using the wrong method for your goals (and you aren't really explaining what you are trying to achieve)
If you want a hard limit on the cluster diameter, use complete-linkage hierarchical clustering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With