Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering latitude longitude points in Python with fixed number of clusters

kmeans does not work properly for geospatial coordinates - even when changing the distance function to haversine as stated here.

I had a look at DBSCAN which doesn t let me set a fixed number of clusters.

  1. Is there any algorithm (in python if possible) that has the same input values as kmeans? or
  2. Can I easily convert latitude, longitude to euclidean coordinates (x,y,z) as done here and do the calculation on my data?

It does not have to perfectly accurate, but it would nice if it would.

like image 985
kev Avatar asked Jul 01 '15 06:07

kev


1 Answers

Using just lat and longitude leads to problems when your geo data spans a large area. Especially since the distance between longitudes is less near the poles. To account for this it is good practice to first convert lon and lat to cartesian coordinates.

If your geo data spans the united states for example you could define an origin from which to calculate distance from as the center of the contiguous united states. I believe this is located at Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.

TO CONVERT lat lon to CARTESIAN coordinates- calculate the distance using haversine, from every location in your dataset to the defined origin. Again, I suggest Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.

You can use haversine in python to calculate these distances:

from haversine import haversine
origin = (39.50, 98.35)
paris = (48.8567, 2.3508)
haversine(origin, paris, miles=True)

Now you can use k-means on this data to cluster, assuming the haversin model of the earth is adequate for your needs. If you are doing data analysis and not planning on launching a satellite I think this should be okay.

like image 83
invoketheshell Avatar answered Sep 21 '22 14:09

invoketheshell