kmeans does not work properly for geospatial coordinates - even when changing the distance function to haversine as stated here.
I had a look at DBSCAN which doesn t let me set a fixed number of clusters.
It does not have to perfectly accurate, but it would nice if it would.
Using just lat and longitude leads to problems when your geo data spans a large area. Especially since the distance between longitudes is less near the poles. To account for this it is good practice to first convert lon and lat to cartesian coordinates.
If your geo data spans the united states for example you could define an origin from which to calculate distance from as the center of the contiguous united states. I believe this is located at Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.
TO CONVERT lat lon to CARTESIAN coordinates- calculate the distance using haversine, from every location in your dataset to the defined origin. Again, I suggest Latitude 39 degrees 50 minutes and Longitude 98 degrees 35 minute.
You can use haversine in python to calculate these distances:
from haversine import haversine
origin = (39.50, 98.35)
paris = (48.8567, 2.3508)
haversine(origin, paris, miles=True)
Now you can use k-means on this data to cluster, assuming the haversin model of the earth is adequate for your needs. If you are doing data analysis and not planning on launching a satellite I think this should be okay.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With