I'm noticing something strange/weird: The exact same query in development/production are not using the same query path. In particular, the development version is using indexes which are omitted in production (in favor of seqscan). The only real difference is that the dataset is production is significantly larger--the index size is 1034 MB, vs 29 MB in production. Would PostgreSQL abstain from using indexes if they (or the table) are too big? EDIT: <code>EXPLAIN ANALYZE</code> for both queries: Development: <pre class="prettyprint"><code>Limit (cost=41638.15..41638.20 rows=20 width=154) (actual time=159.576..159.581 rows=20 loops=1) -> Sort (cost=41638.15..41675.10 rows=14779 width=154) (actual time=159.575..159.577 rows=20 loops=1) Sort Key: (sum(scenario_ad_group_performances.clicks)) Sort Method: top-N heapsort Memory: 35kB -> GroupAggregate (cost=0.00..41244.89 rows=14779 width=154) (actual time=0.040..151.535 rows=14197 loops=1) -> Nested Loop Left Join (cost=0.00..31843.75 rows=93800 width=154) (actual time=0.022..82.509 rows=50059 loops=1) -> Merge Left Join (cost=0.00..4203.46 rows=14779 width=118) (actual time=0.017..27.103 rows=14197 loops=1) Merge Cond: (scenario_ad_groups.id = scenario_ad_group_vendor_instances.ad_group_id) -> Index Scan using scenario_ad_groups_pkey on scenario_ad_groups (cost=0.00..2227.06 rows=14779 width=114) (actual time=0.009..12.085 rows=14197 loops=1) Filter: (scenario_id = 22) -> Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances (cost=0.00..1737.02 rows=27447 width=8) (actual time=0.007..7.021 rows=16528 loops=1) Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[])) -> Index Scan using index_ad_group_performances_on_vendor_instance_id_and_date on scenario_ad_group_performances (cost=0.00..1.73 rows=11 width=44) (actual time=0.002..0.003 rows=3 loops=14197) Index Cond: ((vendor_instance_id = scenario_ad_group_vendor_instances.id) AND (date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date)) Total runtime: 159.710 ms </code></pre> Production: <pre class="prettyprint"><code>Limit (cost=822401.35..822401.40 rows=20 width=179) (actual time=21279.547..21279.591 rows=20 loops=1) -> Sort (cost=822401.35..822488.42 rows=34828 width=179) (actual time=21279.543..21279.560 rows=20 loops=1) Sort Key: (sum(scenario_ad_group_performances.clicks)) Sort Method: top-N heapsort Memory: 33kB -> GroupAggregate (cost=775502.60..821474.59 rows=34828 width=179) (actual time=19126.783..21226.772 rows=34495 loops=1) -> Sort (cost=775502.60..776739.48 rows=494751 width=179) (actual time=19125.902..19884.164 rows=675253 loops=1) Sort Key: scenario_ad_groups.id Sort Method: external merge Disk: 94200kB -> Hash Right Join (cost=25743.86..596796.70 rows=494751 width=179) (actual time=1155.491..16720.460 rows=675253 loops=1) Hash Cond: (scenario_ad_group_performances.vendor_instance_id = scenario_ad_group_vendor_instances.id) -> Seq Scan on scenario_ad_group_performances (cost=0.00..476354.29 rows=4158678 width=44) (actual time=0.043..8949.640 rows=4307019 loops=1) Filter: ((date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date)) -> Hash (cost=24047.72..24047.72 rows=51371 width=143) (actual time=1123.896..1123.896 rows=34495 loops=1) Buckets: 1024 Batches: 16 Memory Usage: 392kB -> Hash Right Join (cost=6625.90..24047.72 rows=51371 width=143) (actual time=92.257..1070.786 rows=34495 loops=1) Hash Cond: (scenario_ad_group_vendor_instances.ad_group_id = scenario_ad_groups.id) -> Seq Scan on scenario_ad_group_vendor_instances (cost=0.00..11336.31 rows=317174 width=8) (actual time=0.020..451.496 rows=431770 loops=1) Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[])) -> Hash (cost=5475.55..5475.55 rows=34828 width=139) (actual time=88.311..88.311 rows=34495 loops=1) Buckets: 1024 Batches: 8 Memory Usage: 726kB -> Bitmap Heap Scan on scenario_ad_groups (cost=798.20..5475.55 rows=34828 width=139) (actual time=4.451..44.065 rows=34495 loops=1) Recheck Cond: (scenario_id = 276) -> Bitmap Index Scan on index_scenario_ad_groups_on_scenario_id (cost=0.00..789.49 rows=34828 width=0) (actual time=4.232..4.232 rows=37006 loops=1) Index Cond: (scenario_id = 276) Total runtime: 21306.697 ms </code></pre>

Disclaimer I have used PostgreSQL very little. I'm answering based on my knowledge of SQL Server index usage and execution plans. I ask the PostgreSQL gods for mercy if I get something wrong. Query Optimizers are Dynamic You said your query plan has changed from your development to production environments. This is to be expected. Query optimizers are designed to generate the optimum execution plan based on the current data conditions. Under different conditions the optimizer may decide it is more efficient to use a table scan vs an index scan. When would it be more efficient to use a table scan vs an index scan? <pre class="prettyprint"><code>SELECT A, B FROM someTable WHERE A = 'SOME VALUE' </code></pre> Let's say you have a non-clustered index on column <code>A</code>. In this case you are filtering on column <code>A</code>, which could potentially take advantage of the index. This would be efficient if the index is selective enough - basically, how many distinct values make up the index? The database keeps statistics on this selectivity info and uses these statistics when calculating costs for execution plans. If you have a million rows in a table, but only 10 possible values for <code>A</code>, then your query would likely return about 100K rows. Because the index is non-clustered, and you are returning columns not included in the index, <code>B</code>, a lookup will need to be performed for each row returned. These look-ups are random-access lookups which are much more expensive then sequential reads used by a table scan. At a certain point it becomes more efficient for the database to just perform a table scan rather than an index scan. This is just one scenario, there are many others. It's hard to know without knowing more about what your data is like, what your indexes look like and how you are trying to access the data. To answer the original question: Would PostgreSQL abstain from using indexes if they (or the table) are too big? No. It is more likely that in the way that you are accessing the data, it is less efficient for PostgreSQL to use the index vs using a table scan. The PostgreSQL FAQ touches on this very subject (see: Why are my queries slow? Why don't they use my indexes?): https://wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F

Try <pre class="prettyprint"><code>SET enable_seqscan TO 'off' </code></pre> before <code>EXPLAIN ANALYZE</code>

PostgreSQL query not using index in production

Tags:

sql

database

indexing

postgresql

I'm noticing something strange/weird:

The exact same query in development/production are not using the same query path. In particular, the development version is using indexes which are omitted in production (in favor of seqscan).

The only real difference is that the dataset is production is significantly larger--the index size is 1034 MB, vs 29 MB in production. Would PostgreSQL abstain from using indexes if they (or the table) are too big?

EDIT: EXPLAIN ANALYZE for both queries:

Development:

Limit  (cost=41638.15..41638.20 rows=20 width=154) (actual time=159.576..159.581 rows=20 loops=1)
  ->  Sort  (cost=41638.15..41675.10 rows=14779 width=154) (actual time=159.575..159.577 rows=20 loops=1)
        Sort Key: (sum(scenario_ad_group_performances.clicks))
        Sort Method: top-N heapsort  Memory: 35kB
        ->  GroupAggregate  (cost=0.00..41244.89 rows=14779 width=154) (actual time=0.040..151.535 rows=14197 loops=1)
              ->  Nested Loop Left Join  (cost=0.00..31843.75 rows=93800 width=154) (actual time=0.022..82.509 rows=50059 loops=1)
                    ->  Merge Left Join  (cost=0.00..4203.46 rows=14779 width=118) (actual time=0.017..27.103 rows=14197 loops=1)
                          Merge Cond: (scenario_ad_groups.id = scenario_ad_group_vendor_instances.ad_group_id)
                          ->  Index Scan using scenario_ad_groups_pkey on scenario_ad_groups  (cost=0.00..2227.06 rows=14779 width=114) (actual time=0.009..12.085 rows=14197 loops=1)
                                Filter: (scenario_id = 22)
                          ->  Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances  (cost=0.00..1737.02 rows=27447 width=8) (actual time=0.007..7.021 rows=16528 loops=1)
                                Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
                    ->  Index Scan using index_ad_group_performances_on_vendor_instance_id_and_date on scenario_ad_group_performances  (cost=0.00..1.73 rows=11 width=44) (actual time=0.002..0.003 rows=3 loops=14197)
                          Index Cond: ((vendor_instance_id = scenario_ad_group_vendor_instances.id) AND (date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
Total runtime: 159.710 ms

Production:

Limit  (cost=822401.35..822401.40 rows=20 width=179) (actual time=21279.547..21279.591 rows=20 loops=1)
  ->  Sort  (cost=822401.35..822488.42 rows=34828 width=179) (actual time=21279.543..21279.560 rows=20 loops=1)
        Sort Key: (sum(scenario_ad_group_performances.clicks))
        Sort Method: top-N heapsort  Memory: 33kB
        ->  GroupAggregate  (cost=775502.60..821474.59 rows=34828 width=179) (actual time=19126.783..21226.772 rows=34495 loops=1)
              ->  Sort  (cost=775502.60..776739.48 rows=494751 width=179) (actual time=19125.902..19884.164 rows=675253 loops=1)
                    Sort Key: scenario_ad_groups.id
                    Sort Method: external merge  Disk: 94200kB
                    ->  Hash Right Join  (cost=25743.86..596796.70 rows=494751 width=179) (actual time=1155.491..16720.460 rows=675253 loops=1)
                          Hash Cond: (scenario_ad_group_performances.vendor_instance_id = scenario_ad_group_vendor_instances.id)
                          ->  Seq Scan on scenario_ad_group_performances  (cost=0.00..476354.29 rows=4158678 width=44) (actual time=0.043..8949.640 rows=4307019 loops=1)
                                Filter: ((date >= '2012-02-01'::date) AND (date <= '2012-02-28'::date))
                          ->  Hash  (cost=24047.72..24047.72 rows=51371 width=143) (actual time=1123.896..1123.896 rows=34495 loops=1)
                                Buckets: 1024  Batches: 16  Memory Usage: 392kB
                                ->  Hash Right Join  (cost=6625.90..24047.72 rows=51371 width=143) (actual time=92.257..1070.786 rows=34495 loops=1)
                                      Hash Cond: (scenario_ad_group_vendor_instances.ad_group_id = scenario_ad_groups.id)
                                      ->  Seq Scan on scenario_ad_group_vendor_instances  (cost=0.00..11336.31 rows=317174 width=8) (actual time=0.020..451.496 rows=431770 loops=1)
                                            Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))
                                      ->  Hash  (cost=5475.55..5475.55 rows=34828 width=139) (actual time=88.311..88.311 rows=34495 loops=1)
                                            Buckets: 1024  Batches: 8  Memory Usage: 726kB
                                            ->  Bitmap Heap Scan on scenario_ad_groups  (cost=798.20..5475.55 rows=34828 width=139) (actual time=4.451..44.065 rows=34495 loops=1)
                                                  Recheck Cond: (scenario_id = 276)
                                                  ->  Bitmap Index Scan on index_scenario_ad_groups_on_scenario_id  (cost=0.00..789.49 rows=34828 width=0) (actual time=4.232..4.232 rows=37006 loops=1)
                                                        Index Cond: (scenario_id = 276)
Total runtime: 21306.697 ms

322

asked Feb 28 '12 02:02

vonconrad

3 Answers

It seems to me that your dev data is much "simpler" than the production data. As an example:

Development:

->  Index Scan using index_scenario_ad_group_vendor_instances_on_ad_group_id on scenario_ad_group_vendor_instances  
(cost=0.00..1737.02 rows=27447 width=8) 
(actual time=0.007..7.021 rows=16528 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))

Production:

->  Seq Scan on scenario_ad_group_vendor_instances  
(cost=0.00..11336.31 rows=317174 width=8) 
(actual time=0.020..451.496 rows=431770 loops=1)
Filter: (ad_vendor_id = ANY ('{1,2,3}'::integer[]))

This means, that in dev 27447 matching row have been estimated upfront and 16528 rows were indeed found. That't the same ballpark and OK.

In production 317174 matching rows have been estimated upfront and 431770 rows were found. Also OK.

But comparing dev to prod means that the numbers are 10 times different. As already other answers indicate, doing 10 times more random seeks (due to index access) might indeed be worse than a plain table scan.

Hence the interesting question is: How many rows does this table contain both in dev and in prod? Is number_returned_rows / number_total_rows comparable between dev and prod?

Edit Don't forget: I have picked one index access as an example. A quick glance shows that the other index accesses have the same symptoms.

109

answered Sep 30 '22 11:09

A.H.

Disclaimer

I have used PostgreSQL very little. I'm answering based on my knowledge of SQL Server index usage and execution plans. I ask the PostgreSQL gods for mercy if I get something wrong.

Query Optimizers are Dynamic

You said your query plan has changed from your development to production environments. This is to be expected. Query optimizers are designed to generate the optimum execution plan based on the current data conditions. Under different conditions the optimizer may decide it is more efficient to use a table scan vs an index scan.

When would it be more efficient to use a table scan vs an index scan?

SELECT A, B
FROM someTable
WHERE A = 'SOME VALUE'

Let's say you have a non-clustered index on column A. In this case you are filtering on column A, which could potentially take advantage of the index. This would be efficient if the index is selective enough - basically, how many distinct values make up the index? The database keeps statistics on this selectivity info and uses these statistics when calculating costs for execution plans.

If you have a million rows in a table, but only 10 possible values for A, then your query would likely return about 100K rows. Because the index is non-clustered, and you are returning columns not included in the index, B, a lookup will need to be performed for each row returned. These look-ups are random-access lookups which are much more expensive then sequential reads used by a table scan. At a certain point it becomes more efficient for the database to just perform a table scan rather than an index scan.

This is just one scenario, there are many others. It's hard to know without knowing more about what your data is like, what your indexes look like and how you are trying to access the data.

To answer the original question:

Would PostgreSQL abstain from using indexes if they (or the table) are too big? No. It is more likely that in the way that you are accessing the data, it is less efficient for PostgreSQL to use the index vs using a table scan.

The PostgreSQL FAQ touches on this very subject (see: Why are my queries slow? Why don't they use my indexes?): https://wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F

answered Sep 30 '22 10:09

J Cooper

Try

SET enable_seqscan TO 'off'

before EXPLAIN ANALYZE

answered Sep 30 '22 11:09

Ramil Amerzyanov

Related questions
                            
                                SQL How to add if else conditions in sql
                            
                                Conditional filter using Knex.js and SQL with multiple search criteria
                            
                                Windows Service or SQL Job?
                            
                                Using 'OR' between HAVING and WHERE clause in MySQL?
                            
                                Does Oracle roll back the transaction on an error?
                            
                                Create a global static variable in SQL Server?
                            
                                SQL: How do I refer to the result of a previous query?
                            
                                SQL update a table with a foreign key
                            
                                how can i loop through all of the columns of the OracleDataReader
                            
                                LIMIT ignored in query with GROUP_CONCAT
                            
                                left join with condition for right table in mysql
                            
                                How to do a SQL Server select using a timestamp minus a certain number of hours
                            
                                How can you track the progress of a SQL update?
                            
                                What are the consequences of declaring a SQL Server Variable with two (or more) '@' symbols?
                            
                                APOSTROPHE inside xml element value
                            
                                VB.NET: How to camel case words that are uppercased
                            
                                Execute a Create Table Query through the JPA EntityManager
                            
                                Generate id row for a view with grouping
                            
                                Comparing two tables in SQLite
                            
                                Only display certain columns in DataGrid from an Entity Object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With