Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?

select  A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total 
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND  100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06'  AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
like image 984
Chris Cashwell Avatar asked Feb 24 '23 08:02

Chris Cashwell


1 Answers

The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.

select count(*) from (1 million row table)               -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x;          -- 1.80 s

So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.

OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :

where (A.power_peak between 1.0 AND  100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06'  AND A.fips_country = 'US' AND A.fips_level2 = '025')

and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'

OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.

Then we got the worst case CROSS JOIN condition ever :

and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))

This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.

Of course it's horribly slow !

When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.

Here's a little example :

Tables a and b are identical and contain 1000 rows, and a column of type BOX.

select * from a cross join b where (a.b && b.b)     --- 0.28 s

Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.

create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b)     --- 0.01 s

Here the index is used to optimize the cross join, and speed is ridiculous.

You need to optimize that geometry matching.

  • add columns which will cache :
    • ST_Centroid(A.wkb_geometry)
    • ST_Buffer((B.wkb_geometry), 1000)

There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.

  • add columns of type BOX which will cache :

    • Bounding Box of ST_Centroid(A.wkb_geometry)
    • Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
  • add gist indexes on the BOXes

  • add a Box overlap test (using the && operator) which will use the index

  • keep your ST_Within which will act as a final filter on the rows that pass

Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :

http://www.postgresql.org/docs/8.2/static/functions-geometry.html

like image 141
bobflux Avatar answered Mar 08 '23 13:03

bobflux