I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
order_lat order_long 0 19.111841 72.910729 1 19.111342 72.908387 2 19.111342 72.908387 3 19.137815 72.914085 4 19.119677 72.905081 5 19.119677 72.905081 6 19.119677 72.905081 7 19.120217 72.907121 8 19.120217 72.907121 9 19.119677 72.905081 10 19.119677 72.905081 11 19.119677 72.905081 12 19.111860 72.911346 13 19.111860 72.911346 14 19.119677 72.905081 15 19.119677 72.905081 16 19.119677 72.905081 17 19.137815 72.914085 18 19.115380 72.909144 19 19.115380 72.909144 20 19.116168 72.909573 21 19.119677 72.905081 22 19.137815 72.914085 23 19.137815 72.914085 24 19.112955 72.910102 25 19.112955 72.910102 26 19.112955 72.910102 27 19.119677 72.905081 28 19.119677 72.905081 29 19.115380 72.909144 30 19.119677 72.905081 31 19.119677 72.905081 32 19.119677 72.905081 33 19.119677 72.905081 34 19.119677 72.905081 35 19.111860 72.911346 36 19.111841 72.910729 37 19.131674 72.918510 38 19.119677 72.905081 39 19.111860 72.911346 40 19.111860 72.911346 41 19.111841 72.910729 42 19.111841 72.910729 43 19.111841 72.910729 44 19.115380 72.909144 45 19.116625 72.909185 46 19.115671 72.908985 47 19.119677 72.905081 48 19.119677 72.905081 49 19.119677 72.905081 50 19.116183 72.909646 51 19.113827 72.893833 52 19.119677 72.905081 53 19.114100 72.894985 54 19.107491 72.901760 55 19.119677 72.905081
I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.
from scipy.spatial.distance import pdist, squareform distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v)))) array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071, 1.05925366, 1.05420922], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], [ 0.2522482 , 0. , 0. , ..., 1.44111548, 0.81742536, 0.98978355], ..., [ 1.67313071, 1.44111548, 1.44111548, ..., 0. , 1.02310118, 1.22871515], [ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118, 0. , 1.39923529], [ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515, 1.39923529, 0. ]])
Then I am applying DBSCAN clustering algorithm on distance matrix.
from sklearn.cluster import DBSCAN db = DBSCAN(eps=2,min_samples=5) y_db = db.fit_predict(distance_matrix)
I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.
DBSCAN can be used when examining spatial data. DBSCAN can be applied to tasks with arbitrary shaped clusters, or clusters within clusters. DBSCAN can find any arbitrary shaped cluster without getting affected by noise.
The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends.
DBSCAN is a typically used clustering algorithm due to its clustering ability for arbitrarily-shaped clusters and its robustness to outliers. Generally, the complexity of DBSCAN is O(n^2) in the worst case, and it practically becomes more severe in higher dimension.
DBSCAN algorithm DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).
You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps
value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit()
takes the coordinates in radian units for the haversine metric.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With