Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Uniform distribution of points

Tags:

sql

database

I have a long table with geolocated points:

id      lat           lon      
-----------------------
1     39.4600    110.3523410
2     39.4601    110.3523410
3     39.4605    110.3523410
4     39.4609    110.3523410

Many of those points will overlap when shown on a map, as they are very close. How could one get uniform distribution of points? That is, a set of points where the distance between them were more than a given one.

For example, the distance (latitude) between point 1 and point 2 is 0.0001. Can I get a table result containing only points separated by more than 0.0003 (or any other quantity)?

Using a geospatial database could be easy, but using normal SQL it seems not an obvious task (for me at least).

Thanks, Javier

like image 672
Javier Avatar asked Nov 09 '11 12:11

Javier


2 Answers

The fastest way to do this (roughly) is to assign every point to a grid square as then only keep one point per square. This is much more efficient that the other techniques listed:

SELECT DISTINCT ROUND(lat*250, 0), ROUND(long*250, 0) FROM sometable;

You may want to average the locations in each grid square:

SELECT AVERAGE(lat), AVERAGE(long)
FROM sometable
GROUP BY ROUND(lat*250, 0), ROUND(long*250.0, 0);

To control the granularity of the grouping, just change the scaling factor up or down from 250.

An alternative (and slower) approach is to do a CROSS JOIN so that every point gets paired with every other point and then use a distance formula to mark pairs that are below a minimum threshold. If the distance formula feels too complex, a simpler way is to restrict the join to where ABS(a.long - b.long) < 0.1 AND ABS(a.lat - b.lat) < 0.1. That will identify points that are close together.

Note that a cross join is an O(n**2) operation so there may be issues if you try to scale this approach to many points. The solution is to pre-group the points into smaller regions and run the cross joins over just points in the region.

If you can do any work outside of SQL, it may be more appropriate to use a clustering algorithm.

like image 111
Raymond Hettinger Avatar answered Oct 11 '22 17:10

Raymond Hettinger


If you want to suppress the pseudo-duplicates, you need something like:

SELECT * FROM mytable a
WHERE NOT EXISTS ( SELECT *
    FROM mytable b
    WHERE ABS (a.long - b.long) < 0.01
      AND ABS (a.lat - b.lat) < 0.02
      AND b.id < a.id
    );

UPDATE: (added data,index, query without ABS())

DROP TABLE tmp.mytable;

CREATE TABLE tmp.mytable
    ( id INTEGER NOT NULL PRIMARY KEY
    , zlat REAL NOT NULL
    , zlong  REAL NOT NULL
    );
INSERT INTO tmp.mytable (id, zlat, zlong)
    SELECT generate_series(1,10000), 0.0, 0.0
    ;

SET search_path=tmp;
UPDATE tmp.mytable SET zlat = 39.0 + random() ;

UPDATE tmp.mytable SET zlong = 110.0 + random() ;

CREATE INDEX latlong ON tmp.mytable (zlat, zlong);

VACUUM ANALYZE tmp.mytable;
/***/
SET search_path=tmp;


EXPLAIN ANALYZE
SELECT * FROM mytable a
WHERE NOT EXISTS ( SELECT *
    FROM mytable b
    WHERE 1=1
      AND ABS (a.zlong - b.zlong) < 0.01
      AND ABS (a.zlat - b.zlat) < 0.02
      AND b.id < a.id
    );

EXPLAIN ANALYZE
SELECT * FROM mytable a
WHERE NOT EXISTS ( SELECT *
    FROM mytable b
    WHERE 1=1
      AND a.zlong - b.zlong < 0.01  AND b.zlong - a.zlong < 0.01
      AND a.zlat - b.zlat < 0.02  AND b.zlat - a.zlat < 0.02
      AND b.id < a.id
    );

The query plan shows that the only the primary index is actually used. Basically the same plan is generated when the abs() is replaced by "(a-b) < 0.0x AND (b-a) < 0.0x". But ABS() actually is faster.

---------------------------------------------
 Nested Loop Anti Join  (cost=0.00..1448079.64 rows=9630 width=12) (actual time=0.151..3966.487 rows=1288 loops=1)
   Join Filter: ((abs((a.zlong - b.zlong)) < 0.01::double precision) AND (abs((a.zlat - b.zlat)) < 0.02::double precision))
   ->  Seq Scan on mytable a  (cost=0.00..263.00 rows=10000 width=12) (actual time=0.139..3.463 rows=10000 loops=1)
   ->  Index Scan using mytable_pkey on mytable b  (cost=0.00..58.68 rows=3333 width=12) (actual time=0.005..0.173 rows=1084 loops=10000)
         Index Cond: (b.id < a.id)
 Total runtime: 3966.853 ms
(6 rows)

---------------------------------------------
 Nested Loop Anti Join  (cost=0.00..1663497.55 rows=9959 width=12) (actual time=0.065..4210.616 rows=1288 loops=1)
   Join Filter: (((a.zlong - b.zlong) < 0.01::double precision) AND ((b.zlong - a.zlong) < 0.01::double precision) AND ((a.zlat - b.zlat) < 0.02::double precision) AND ((b.zlat - a.zlat) < 0.02::double precision))
   ->  Seq Scan on mytable a  (cost=0.00..263.00 rows=10000 width=12) (actual time=0.060..2.840 rows=10000 loops=1)
   ->  Index Scan using mytable_pkey on mytable b  (cost=0.00..58.68 rows=3333 width=12) (actual time=0.005..0.173 rows=1084 loops=10000)
         Index Cond: (b.id < a.id)
 Total runtime: 4210.904 ms
(6 rows)
like image 32
wildplasser Avatar answered Oct 11 '22 15:10

wildplasser