Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python

Using the following code to cluster geolocation coordinates results in 3 clusters:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.vq import kmeans2, whiten

    coordinates= np.array([
               [lat, long],
               [lat, long],
                ...
               [lat, long]
               ])
    x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
    plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
    plt.show()

Is it right to use Kmeans for location clustering, as it uses Euclidean distance and not Haversine formula as a distance function?

like image 225
rok Avatar asked Jul 15 '14 15:07

rok


People also ask

What does K mean in coordinates?

The k-means algorithm groups N observations (i.e., rows in an array of coordinates) into k clusters. However, k-means is not an ideal algorithm for latitude-longitude spatial data because it minimizes variance, not geodetic distance.


1 Answers

k-means is not a good algorithm to use for spatial clustering, for the reasons you meantioned. Instead, you could do this clustering job using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm.

This tutorial demonstrates clustering latitude-longitude spatial data with DBSCAN/haversine and avoids all those Euclidean-distance problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. Also notice that the eps value is in radians and that .fit() takes the coordinates in radian units for the haversine metric.

like image 164
eos Avatar answered Sep 20 '22 08:09

eos