Which GEO implementation to use for millions of points

Question

I am trying to figure out which GEO implementation to use to find the nearest points based on long/lat to a certain point. I will have millions if not billions of different latitude/longitude points that will need to be compared. I have been looking at many different implementations to do the job I need to be done. I have looked into Postgis (looks like it is very popular and performs well), Neo4J (Graph databases are a new concept to me and I am unsure how they perfrom), AWS dynamodb geohash (Scales very well, but only library is written in Java, I am hoping to write a library in node.js), etc but can't figure out which would perform best. I am purely looking into performance opposed to number of features. All I need to be able is to compare one point to all points and find the closest (read operation), and as well, be able to change a point in the database quickly (write operation). Could anyone suggest based on these requirements a good implementation

Patrick · Accepted Answer

PostGIS has several function for geohashing. If you make your strings long enough the search becomes quicker (fewer collisions per box + its 8 neighbours) but the geohash generation slower on inserting new points.

The question is also how accurate you want to be. At increasing latitude, lat/long "distance" deteriorates because a degree of longitude shrinks from about 110km at the Equator to 0 at the poles, while a degree of latitude is always about 110km. At the mid-latitude of 45 degrees a degree of longitude is nearly 79km, giving an error in distance of a factor of 2 (sqr(110/79)). Spherical distance to give you true distance between lat/long pairs is very expensive to calculate (lots of trigonometry going on) and then you geohashing won't work (unless you convert all points to planar coordinates).

A solution that might work is the following:

CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)). This gives you an index on a box twice as large as your resolution, which helps finding points and reducing the need to search neighbouring hash boxes.
On INSERT of a point, compute its geohash of length 9 (10m resolution approx.) into hash_column, using PostGIS. You could use a BEFORE INSERT TRIGGER here.

In a function:

Given a point, find the nearest point by looking for all points with a geohash value shortened to 8 chars equal to the given points 8-char geohash (hence the index above).
Compute the distance to each of the encountered points using spherical coordinates, keeping the closest point. But since you are only looking for the nearest point (at least initially), do not search on distance using spherical coordinates, but use the below optimization, which should make the search much much faster.
Compute if the given point is closer to the edge of the box determined by the 8-char geohash than the closest computed point. If so, repeat the procedure with the 7-char geohash on all points in its 8 neighbours. This can be highly optimized by computing distances to individual box sides and corners and evaluating only the relevant neighbour hash boxes; I leave this to you to tinker with.

In any case, this will not be particularly speedy. If you are indeed going towards billions of points you might want to think about clustering which has a rather "natural" solution for geohashing (break up your table on substring(hash_column FROM 1 FOR 2) for instance, giving you four quadrants). Just make sure that you account for cross-boundary searches.

Two optimizations can be made fairly quickly:

First, "normalize" your spherical coordinates (meaning: compensate for the reduced length of a degree of longitude with increasing latitude) so that you can search for nearest points using a "pseudo-cartesian" approach. This only works if points are close together, but since you are working with lots of points this should not be a problem. More specifically, this should work for all points in geohash boxes of length 6 or more.

Assuming the WGS84 ellipsoid (which is used in all GPS devices), the Earth's major axis (a) is 6,378,137 meter, with an ellipticity (e2) of 0.00669438. A second of longitude has a length of

longSec := Pi * a * cos(lat) / sqrt(1 - e2 * sqr(sin(lat))) / 180 / 3600

or

longSec := 30.92208078 * cos(lat) / sqrt(1 - 0.00669438 * sqr(sin(lat)))

For a second of latitude:

latSec := 30.870265 - 155.506 * cos(2 * lat) + 0.0003264 + cos(4 * lat)

The correction factor to make your local coordinate system "square" is by multiplying your longitude values by longSec/latSec.

Secondly, since you are looking for the nearest point, do not search on distance because of the computationally expensive square root. Instead, search on the term within the square root, the squared distance if you will, because this has the same property of selecting for the nearest point.

In pseudo-code:

CREATE FUNCTION nearest_point(pt geometry, ptHash8 char(8)) RETURNS integer AS $$
DECLARE
  corrFactor double precision;
  ptLat    double precision;
  ptLong     double precision;
  currPt     record;
  minDist    double precision;
  diffLat    double precision;
  diffLong   double precision;
  minId      integer;
BEGIN
  minDist := 100000000.; -- a large value, 10km (squared)
  ptLat := ST_Y(pt);
  ptLong := ST_X(pt);
  corrFactor := 30.92208078 * cos(radians(ptLat)) / (sqrt(1 - 0.00669438 * power(sin(radians(ptLat)), 2)) *
                (30.870265 - 155.506 * cos(2 * radians(ptLat)) + 0.0003264 + cos(4 * radians(ptLat))));
  FOR currPt IN SELECT * FROM all_points WHERE hash8 = ptHash8
  LOOP
    diffLat := ST_Y(currPt.pt) - ptLat;
    diffLong := (ST_X(currPt.pt) - ptLong) * corrFactor; -- "square" things out
    IF (diffLat * diffLat) < (minDist * diffLong * diffLong) THEN -- no divisions here to speed thing up a little further
      minDist := (diffLat * diffLat) / (diffLong * diffLong); -- this does not happen so often
      minId := currPt.id;
    END IF;
  END LOOP;
  IF minDist < 100000000. THEN
    RETURN minId;
  ELSE
    RETURN NULL;
  END IF;
END; $$ LANGUAGE PLPGSQL STRICT;

Needless to say, this would be a lot faster in a C language function. Also, do not forget to do boundary checking to see if neighbouring geohash boxes need to be searched.

Incidentally, "spatial purists" would not index on the 8-char geohash and search from there; instead they would start from the 9-char hash and work outwards from there. However, a "miss" in your initial hash box (because there are no other points or you are close to a hash box side) is expensive because you have to start computing distances to neighbouring hash boxes and pull in more data. In practice you should work from a hash box which is about twice the size of the typical nearest point; what that distance is depends on your point set.

Which GEO implementation to use for millions of points

Tags:

postgresql

amazon-dynamodb

postgis

geospatial

neo4j

user2924127

1 Answers

Patrick

Recent Activity

Donate For Us

Which GEO implementation to use for millions of points

Tags:

postgresql

amazon-dynamodb

postgis

geospatial

neo4j

user2924127

1 Answers

Patrick

Related questions

Recent Activity

Donate For Us