I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze
, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?
I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk
, so the rows for a given stk
are on a handful of pages that pretty much only contain that stk
.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.
There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
REINDEX INDEX indexname
ANALYZE tablename
VACUUM tablename
effective_cache_size
random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With