Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering Lat/Longs in a Database

I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.

There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(

This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).

Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?

please help :)

EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).

EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.

like image 743
Pure.Krome Avatar asked Dec 01 '08 04:12

Pure.Krome


People also ask

How do you store latitude and longitude in a database?

Latitude & longitude values can be represented & stored in a SQL database using decimal points (Decimal degrees) rather than degrees (or Degrees Minutes Seconds). Depending on the accuracy you may require, you can decide how many decimal points you need to keep latitude & longitude values.

What is lat in database?

Lat/Long is a position format (equivalent to UTM, MGRS, NZTM etc), whereas WGS84 is a datum - a reference system that position coordinates are applied to. You need to know both your coordinates and which datum you are using, although for online use it is almost always WGS84.

What is cluster in SQL with example?

A cluster is a schema object that contains data from one or more tables, all of which have one or more columns in common. Oracle Database stores together all the rows from all the tables that share the same cluster key.

How many bytes can store latitude and longitude?

latitude : 4-Byte Floating Point Number Specifies the geographic latitude of the location. longitude : 4-Byte Floating Point Number Specifies the geographic longitude of the location.


3 Answers

I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).

I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.

*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).

like image 69
Drew Hall Avatar answered Oct 17 '22 10:10

Drew Hall


If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.

What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.

Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.

The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.

You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.

like image 28
paxdiablo Avatar answered Oct 17 '22 08:10

paxdiablo


I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:

def compute_chunk(latitude, longitude)
  (floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end

def floor_lon(longitude)
  ((longitude + 180) * 10).to_i
end

def floor_lat(latitude)
  ((latitude + 90) * 10).to_i
end

Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).

like image 5
Dustin Avatar answered Oct 17 '22 10:10

Dustin