Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DBSCAN using spatial and temporal data

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.

What are my options for incorporating temporal data into the DBSCAN algorithm?

like image 469
cbake Avatar asked Jun 02 '15 17:06

cbake


People also ask

Is DBSCAN can be used when examining spatial data?

Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and machine learning.

Does DBSCAN work with high dimensional data?

An Efficient Density-based Clustering Algorithm for Higher-Dimensional Data. DBSCAN is a typically used clustering algorithm due to its clustering ability for arbitrarily-shaped clusters and its robustness to outliers.

What are the 2 major components of DBSCAN clustering?

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).

Can DBSCAN be used for anomaly detection?

The experimental result shows that the anomaly detecting based on enhanced DBScan algorithm can a higher detection rate and a low rate of false positives of DARPA data sets.


2 Answers

Look up Generalized DBSCAN by the same authors.

Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.

For (Generalized) DBSCAN, you need two functions:

  1. findNeighbors - get all "related" objects from your database

  2. corePoint - decide whether this set is enough to start a cluster

then you can repeatedly find neighbors to grow the clusters.

Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).

like image 69
Has QUIT--Anony-Mousse Avatar answered Sep 30 '22 15:09

Has QUIT--Anony-Mousse


tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.

DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.

Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.

IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.

more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

like image 25
oxymor0n Avatar answered Sep 30 '22 14:09

oxymor0n