Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the following Postgres SQL query take so long?

Tags:

sql

postgresql

The raw query is as follows

SELECT "TIME", "TRADEPRICE"
  FROM "YEAR" where "DATE"='2010-03-01' 
  and "SECURITY"='STW.AX' 
  AND "TIME" < '10:16:00' 
  AND "TYPE" = 'TRADE'
  ORDER BY "TIME" ASC LIMIT 3

I have built three indices as follows

Columns "DATE" DESC NULLS LAST
Columns "SECURITY" DESC NULLS LAST
Columns "TIME" DESC NULLS LAST

I do not index TYPE because it only takes one of two possible values

Explain analyze produces the following

"Limit  (cost=50291.28..50291.28 rows=3 width=16) (actual time=1794484.566..1794484.567 rows=3 loops=1)"
"  ->  Sort  (cost=50291.28..50291.29 rows=4 width=16) (actual time=1794484.562..1794484.563 rows=3 loops=1)"
"        Sort Key: "TIME""
"        Sort Method:  top-N heapsort  Memory: 25kB"
"        ->  Bitmap Heap Scan on "YEAR"  (cost=48569.54..50291.24 rows=4 width=16) (actual time=1794411.662..1794484.498 rows=20 loops=1)"
"              Recheck Cond: (("SECURITY" = 'STW.AX'::bpchar) AND ("DATE" = '2010-03-01'::date))"
"              Filter: (("TIME" < '10:16:00'::time without time zone) AND ("TYPE" = 'TRADE'::bpchar))"
"              ->  BitmapAnd  (cost=48569.54..48569.54 rows=430 width=0) (actual time=1794411.249..1794411.249 rows=0 loops=1)"
"                    ->  Bitmap Index Scan on security_desc  (cost=0.00..4722.94 rows=166029 width=0) (actual time=1793917.506..1793917.506 rows=1291933 loops=1)"
"                          Index Cond: ("SECURITY" = 'STW.AX'::bpchar)"
"                    ->  Bitmap Index Scan on date_desc  (cost=0.00..43846.35 rows=2368764 width=0) (actual time=378.698..378.698 rows=2317130 loops=1)"
"                          Index Cond: ("DATE" = '2010-03-01'::date)"
"Total runtime: 1794485.224 ms"

The database is about 1 billion rows running on Core2Quad with 8gig RAM on Ubuntu 64bit. Surely this query should not take half an hour

like image 422
deltanovember Avatar asked May 30 '11 09:05

deltanovember


2 Answers

The database is about 1 billion rows running on Core2Quad with 8gig RAM on Ubuntu 64bit. Surely this query should not take half an hour

It's taking half an hour because of the way you set up your indexes.

Your query has no multi-column indexes that it can use to head straight to the needed rows. It does the next best thing, which is a bitmap index scan on barely selective indexes, and top-3 sorting the resulting set.

The two indexes in question, on the security and on the date, yield 1.3M and 2.3M rows respectively. Combining them will be excruciatingly slow because you're randomly looking up over a million rows and filtering each one.

Adding insult to injury, your data structure is such that two highly correlated fields (date and time) are stored and manipulated separately. This confuses the query planner because Postgres does not collect correlation data. Your queries will thus almost always resort to filtering through enormous sets of data, and ordering the filtered set on a separate criteria.

I'd suggest the following changes:

  1. Alter the table and add a datetime column, of type timestamp with time zone. Combine your date and time columns into it.

  2. Drop the date and time fields accordingly, as well as the indexes on them. Also drop the index on security.

  3. Create an index on (security, datetime). (And don't mess around with nulls first/nulls last unless your ordering criteria contain those clauses too.)

  4. At your option, add a separate index on (datetime) or on (datetime, security), if you ever need to execute queries that do stats on all trades in a day or date range.

  5. vacuum analyze the whole mess once you're done with the above.

You'll then be able to rewrite your query like so:

SELECT "TIME", "TRADEPRICE"
  FROM "YEAR"
  WHERE '2010-03-01 00:00:00' <= "DATETIME" AND "DATETIME" < '2010-03-01 10:16:00'
  AND "SECURITY"='STW.AX' 
  AND "TYPE" = 'TRADE'
  ORDER BY "DATETIME" ASC LIMIT 3

This will yield the most optimial plan: retrieving the top-3 rows from a filtered index scan on (security, datetime), which I'd expect (since you've a billion rows) will take 25ms at most.

like image 177
Denis de Bernardy Avatar answered Oct 13 '22 13:10

Denis de Bernardy


Add a composite index of many of the search terms together, eg ON YEAR (TYPE, SECURITY, DATE, TIME). Then the database can look up in a single index to match all of them, instead of having to search multiple indexes and collate all the results together (Bitmap Index Scan).

Exactly which columns (eg include TYPE or not?) and what order you include them in the index depends on data characteristics and what kind of other queries you're doing (since you get to re-use any left-subset of a composite index for free), so experiment a bit; but to encourage order optimisation keep the ORDER BY column as the last used index column/direction.

You may also want to ANALYZE to update stats for the query planner as some of the number-of-rows guesses seem to be a bit off.

like image 44
bobince Avatar answered Oct 13 '22 12:10

bobince