Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL query not using index in production

I'm noticing something strange/weird:

The exact same query in development/production are not using the same query path. In particular, the development version is using indexes which are omitted in production (in favor of seqscan).

The only real difference is that the dataset is production is significantly larger--the index size is 1034 MB, vs 29 MB in production. Would PostgreSQL abstain from using indexes if they (or the table) are too big?

EDIT: EXPLAIN ANALYZE for both queries:

Development:

Limit  (cost=41638.15..41638.20 rows=20 width=154) (actual time=159.576..159.581 rows=20 loops=1)
  ->  Sort  (cost=41638.15..41675.10 rows=14779 width=154) (actual time=159.575..159.577 rows=20 loops=1)
        Sort Key: (sum(scenario_ad_group_performances.clicks))
        Sort Method: top-N heapsort  Memory: 35kB
        ->  GroupAggregate  (cost=0.00..41244.89 rows=14779 width=154) (actual time=0.040..151.535 rows=14197 loops=1)
              ->  Nested Loop Left Join  (cost=0.00..31843.75 rows=93800 width=154) (actual time=0.022..82.509 rows=50059 loops=1)
                    ->  Merge Left Join  (cost=0.00..4203.46 rows=14779 width=118) (actual time=0.017..27.103 rows=14197 loops=1)
                          Merge Cond: (scenario_ad_groups.id = scenario_ad_group_vendor_instances.ad_group_id)
                          ->  Index Scan using scenario_ad_groups_pkey on scenario_ad_groups  (cost=0.00..2227.06 rows=14779 width=114) (actual time=0.009..12.085 rows=14197 loops=1)
                                Filter: (scenario_id = 22)
                          ->  Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances  (cost=0.00..1737.02 rows=27447 width=8) (actual time=0.007..7.021 rows=16528 loops=1)
                                Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
                    ->  Index Scan using index_ad_group_performances_on_vendor_instance_id_and_date on scenario_ad_group_performances  (cost=0.00..1.73 rows=11 width=44) (actual time=0.002..0.003 rows=3 loops=14197)
                          Index Cond: ((vendor_instance_id = scenario_ad_group_vendor_instances.id) AND (date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
Total runtime: 159.710 ms

Production:

Limit  (cost=822401.35..822401.40 rows=20 width=179) (actual time=21279.547..21279.591 rows=20 loops=1)
  ->  Sort  (cost=822401.35..822488.42 rows=34828 width=179) (actual time=21279.543..21279.560 rows=20 loops=1)
        Sort Key: (sum(scenario_ad_group_performances.clicks))
        Sort Method: top-N heapsort  Memory: 33kB
        ->  GroupAggregate  (cost=775502.60..821474.59 rows=34828 width=179) (actual time=19126.783..21226.772 rows=34495 loops=1)
              ->  Sort  (cost=775502.60..776739.48 rows=494751 width=179) (actual time=19125.902..19884.164 rows=675253 loops=1)
                    Sort Key: scenario_ad_groups.id
                    Sort Method: external merge  Disk: 94200kB
                    ->  Hash Right Join  (cost=25743.86..596796.70 rows=494751 width=179) (actual time=1155.491..16720.460 rows=675253 loops=1)
                          Hash Cond: (scenario_ad_group_performances.vendor_instance_id = scenario_ad_group_vendor_instances.id)
                          ->  Seq Scan on scenario_ad_group_performances  (cost=0.00..476354.29 rows=4158678 width=44) (actual time=0.043..8949.640 rows=4307019 loops=1)
                                Filter: ((date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
                          ->  Hash  (cost=24047.72..24047.72 rows=51371 width=143) (actual time=1123.896..1123.896 rows=34495 loops=1)
                                Buckets: 1024  Batches: 16  Memory Usage: 392kB
                                ->  Hash Right Join  (cost=6625.90..24047.72 rows=51371 width=143) (actual time=92.257..1070.786 rows=34495 loops=1)
                                      Hash Cond: (scenario_ad_group_vendor_instances.ad_group_id = scenario_ad_groups.id)
                                      ->  Seq Scan on scenario_ad_group_vendor_instances  (cost=0.00..11336.31 rows=317174 width=8) (actual time=0.020..451.496 rows=431770 loops=1)
                                            Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
                                      ->  Hash  (cost=5475.55..5475.55 rows=34828 width=139) (actual time=88.311..88.311 rows=34495 loops=1)
                                            Buckets: 1024  Batches: 8  Memory Usage: 726kB
                                            ->  Bitmap Heap Scan on scenario_ad_groups  (cost=798.20..5475.55 rows=34828 width=139) (actual time=4.451..44.065 rows=34495 loops=1)
                                                  Recheck Cond: (scenario_id = 276)
                                                  ->  Bitmap Index Scan on index_scenario_ad_groups_on_scenario_id  (cost=0.00..789.49 rows=34828 width=0) (actual time=4.232..4.232 rows=37006 loops=1)
                                                        Index Cond: (scenario_id = 276)
Total runtime: 21306.697 ms
like image 322
vonconrad Avatar asked Feb 28 '12 02:02

vonconrad


People also ask

Why is Postgres query not using index?

As we saw above, running a couple of queries on our posts table reveals that even given an index to use, Postgres will not always choose to use it. The reason why this is the case is that indexes have a cost to create and maintain (on writes) and use (on reads).

Does Postgres automatically use index?

PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table. The index covers the columns that make up the primary key or unique constraint (a multicolumn index, if appropriate), and is the mechanism that enforces the constraint.

Why does Postgres use seq scan instead of index?

If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap).

Does index always improve query performance?

Indexes are meant to speed up the performance of a database, so use indexing whenever it significantly improves the performance of your database. As your database becomes larger and larger, the more likely you are to see benefits from indexing.


3 Answers

It seems to me that your dev data is much "simpler" than the production data. As an example:

Development:

->  Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances  
(cost=0.00..1737.02 rows=27447 width=8) 
(actual time=0.007..7.021 rows=16528 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))

Production:

->  Seq Scan on scenario_ad_group_vendor_instances  
(cost=0.00..11336.31 rows=317174 width=8) 
(actual time=0.020..451.496 rows=431770 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))

This means, that in dev 27447 matching row have been estimated upfront and 16528 rows were indeed found. That't the same ballpark and OK.

In production 317174 matching rows have been estimated upfront and 431770 rows were found. Also OK.

But comparing dev to prod means that the numbers are 10 times different. As already other answers indicate, doing 10 times more random seeks (due to index access) might indeed be worse than a plain table scan.

Hence the interesting question is: How many rows does this table contain both in dev and in prod? Is number_returned_rows / number_total_rows comparable between dev and prod?

Edit Don't forget: I have picked one index access as an example. A quick glance shows that the other index accesses have the same symptoms.

like image 109
A.H. Avatar answered Sep 30 '22 11:09

A.H.


Disclaimer

I have used PostgreSQL very little. I'm answering based on my knowledge of SQL Server index usage and execution plans. I ask the PostgreSQL gods for mercy if I get something wrong.

Query Optimizers are Dynamic

You said your query plan has changed from your development to production environments. This is to be expected. Query optimizers are designed to generate the optimum execution plan based on the current data conditions. Under different conditions the optimizer may decide it is more efficient to use a table scan vs an index scan.

When would it be more efficient to use a table scan vs an index scan?

SELECT A, B
FROM someTable
WHERE A = 'SOME VALUE'

Let's say you have a non-clustered index on column A. In this case you are filtering on column A, which could potentially take advantage of the index. This would be efficient if the index is selective enough - basically, how many distinct values make up the index? The database keeps statistics on this selectivity info and uses these statistics when calculating costs for execution plans.

If you have a million rows in a table, but only 10 possible values for A, then your query would likely return about 100K rows. Because the index is non-clustered, and you are returning columns not included in the index, B, a lookup will need to be performed for each row returned. These look-ups are random-access lookups which are much more expensive then sequential reads used by a table scan. At a certain point it becomes more efficient for the database to just perform a table scan rather than an index scan.

This is just one scenario, there are many others. It's hard to know without knowing more about what your data is like, what your indexes look like and how you are trying to access the data.

To answer the original question:

Would PostgreSQL abstain from using indexes if they (or the table) are too big? No. It is more likely that in the way that you are accessing the data, it is less efficient for PostgreSQL to use the index vs using a table scan.

The PostgreSQL FAQ touches on this very subject (see: Why are my queries slow? Why don't they use my indexes?): https://wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F

like image 22
J Cooper Avatar answered Sep 30 '22 10:09

J Cooper


Try

SET enable_seqscan TO 'off'

before EXPLAIN ANALYZE

like image 32
Ramil Amerzyanov Avatar answered Sep 30 '22 11:09

Ramil Amerzyanov