Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DBSCAN for clustering of geographic location data

I have a dataframe with latitude and longitude pairs.

Here is my dataframe look like.

    order_lat  order_long 0   19.111841   72.910729 1   19.111342   72.908387 2   19.111342   72.908387 3   19.137815   72.914085 4   19.119677   72.905081 5   19.119677   72.905081 6   19.119677   72.905081 7   19.120217   72.907121 8   19.120217   72.907121 9   19.119677   72.905081 10  19.119677   72.905081 11  19.119677   72.905081 12  19.111860   72.911346 13  19.111860   72.911346 14  19.119677   72.905081 15  19.119677   72.905081 16  19.119677   72.905081 17  19.137815   72.914085 18  19.115380   72.909144 19  19.115380   72.909144 20  19.116168   72.909573 21  19.119677   72.905081 22  19.137815   72.914085 23  19.137815   72.914085 24  19.112955   72.910102 25  19.112955   72.910102 26  19.112955   72.910102 27  19.119677   72.905081 28  19.119677   72.905081 29  19.115380   72.909144 30  19.119677   72.905081 31  19.119677   72.905081 32  19.119677   72.905081 33  19.119677   72.905081 34  19.119677   72.905081 35  19.111860   72.911346 36  19.111841   72.910729 37  19.131674   72.918510 38  19.119677   72.905081 39  19.111860   72.911346 40  19.111860   72.911346 41  19.111841   72.910729 42  19.111841   72.910729 43  19.111841   72.910729 44  19.115380   72.909144 45  19.116625   72.909185 46  19.115671   72.908985 47  19.119677   72.905081 48  19.119677   72.905081 49  19.119677   72.905081 50  19.116183   72.909646 51  19.113827   72.893833 52  19.119677   72.905081 53  19.114100   72.894985 54  19.107491   72.901760 55  19.119677   72.905081 

I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.

from scipy.spatial.distance import pdist, squareform distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))  array([[ 0.        ,  0.2522482 ,  0.2522482 , ...,  1.67313071,      1.05925366,  1.05420922],    [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,      0.81742536,  0.98978355],    [ 0.2522482 ,  0.        ,  0.        , ...,  1.44111548,      0.81742536,  0.98978355],    ...,     [ 1.67313071,  1.44111548,  1.44111548, ...,  0.        ,      1.02310118,  1.22871515],    [ 1.05925366,  0.81742536,  0.81742536, ...,  1.02310118,      0.        ,  1.39923529],    [ 1.05420922,  0.98978355,  0.98978355, ...,  1.22871515,      1.39923529,  0.        ]]) 

Then I am applying DBSCAN clustering algorithm on distance matrix.

 from sklearn.cluster import DBSCAN   db = DBSCAN(eps=2,min_samples=5)  y_db = db.fit_predict(distance_matrix) 

I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.

like image 697
Neil Avatar asked Jan 03 '16 17:01

Neil


People also ask

Is DBSCAN can be used when examining spatial data?

DBSCAN can be used when examining spatial data. DBSCAN can be applied to tasks with arbitrary shaped clusters, or clusters within clusters. DBSCAN can find any arbitrary shaped cluster without getting affected by noise.

Which case is recommended to use DBSCAN clustering?

The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends.

Is DBSCAN good for high dimensional data?

DBSCAN is a typically used clustering algorithm due to its clustering ability for arbitrarily-shaped clusters and its robustness to outliers. Generally, the complexity of DBSCAN is O(n^2) in the worst case, and it practically becomes more severe in higher dimension.

Can DBSCAN handle outliers?

DBSCAN algorithm DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).


1 Answers

You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates)) 

This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit() takes the coordinates in radian units for the haversine metric.

like image 77
eos Avatar answered Sep 23 '22 21:09

eos