Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency problem querying postgresql table

Tags:

sql

postgresql

I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:

                       Table "public.eods"
 Column |         Type          | Collation | Nullable | Default 
--------+-----------------------+-----------+----------+---------
 stk    | character varying(16) |           | not null | 
 dt     | date                  |           | not null | 
 o      | integer               |           | not null | 
 hi     | integer               |           | not null | 
 lo     | integer               |           | not null | 
 c      | integer               |           | not null | 
 v      | integer               |           |          | 
Indexes:
    "eods_pkey" PRIMARY KEY, btree (stk, dt)
    "eods_dt_idx" btree (dt)

I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.

Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:

explain analyze select * from eods where stk='MSFT';
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on eods  (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
   Recheck Cond: ((stk)::text = 'MSFT'::text)
   Heap Blocks: exact=367
   ->  Bitmap Index Scan on eods_pkey  (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
         Index Cond: ((stk)::text = 'MSFT'::text)
 Planning Time: 577.906 ms
 Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using eods_dt_idx on eods  (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
   Index Cond: (dt = '2010-02-22'::date)
 Planning Time: 67.876 ms
 Execution Time: 56970.499 ms
(4 rows)

I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?

like image 241
ZX2XZX2 Avatar asked Mar 03 '23 19:03

ZX2XZX2


2 Answers

I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.

So, the execution engine is only reading about 25 pages.

On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.

The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.

like image 50
Gordon Linoff Avatar answered Apr 01 '23 02:04

Gordon Linoff


There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:

  • unhealthy index - try to run REINDEX INDEX indexname
  • bad statistics - try to run ANALYZE tablename
  • suboptimal state of table - try to run VACUUM tablename
  • too low, or to high setting of effective_cache_size
  • issues with IO - some systems has a problem with high random IO, try to increase random_page_cost

Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is

  • VACUUM ANALYZE tablename
  • benchmark your IO if it is possible (like bonie++)
like image 26
Pavel Stehule Avatar answered Apr 01 '23 04:04

Pavel Stehule