I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.geographies
may have 150,000,000 rows, paths
may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids
for table geographies
, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography
has a corresponding intersecting path
we need to also find all other geographies
that this path
intersects. From there, we need to find all other paths
that these newly found geographies
intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
I suggest you convert the column paths.path
to data type geography
(or at least geometry
). path
is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry
or path::geometry::geography
(resulting in a LINESTRING
internally) to make it work with PostGIS functions like ST_Intersects()
.
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry
for both columns just as well. geography
is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct
) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography
and paths.path
:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id
as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid
type. It would work with bigint
after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id
, too. Again, a multicolumn index on (fk_id, id, geography)
might pay if you can get index-only scans out of it. Default btree index, fk_id
must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP
drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE
in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS
to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS
. CREATE TABLE
does not set FOUND
(as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ...
is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...
) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE
temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0
is an awkward and more expensive way of saying _geo_ct > _geographyLength
. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql
.
Your function parameter is varchar[]
for an array of fk_id
, but you later commented:
It is a
bigint
field that represents a geographic area (it's actually a precomputeds2cell
id at level 15).
I don't know s2cell
id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[]
.
Also since you commented:
There are always exactly 13
fk_id
s passed in.
This seems like a perfect use case for a VARIADIC
function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
The cast to text
is necessary because the geography
type is not "hashable" (same for geometry
). (See this open PostGIS issue for details.) Work around it by casting to text
. Rows are unique by virtue of (id, type)
alone, we can ignore the geography
columns for this. Cast back to geography
for the join. Shouldn't cost too much extra.
We need two LEFT JOIN
so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary intermediate table. The manual:
For
UNION
(but notUNION ALL
), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With