Inconsistent statistics on expression with partial index

Question

[PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.2.0-10) 6.2.0 20161027, 64-bit]

I have a table with timestamp ranges:

create table testing.test as 
select tsrange(d, null) ts from 
generate_series(timestamp '2000-01-01', timestamp '2018-01-01', interval '1 minute') s(d);

I need to run the following query:

select * 
from testing.test 
where lower(ts)> '2017-06-17 20:00:00'::timestamp and upper_inf(ts)

Explain analyze result for table without indexes:

Seq Scan on test  (cost=0.00..72482.26 rows=1052013 width=14) (actual time=2165.477..2239.781 rows=283920 loops=1)
  Filter: (upper_inf(ts) AND (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone))
  Rows Removed by Filter: 9184081
Planning time: 0.046 ms
Execution time: 2250.221 ms

Next I'm going to add a following partial index:

create index lower_rt_inf ON testing.test using btree(lower(ts)) where upper_inf(ts);    
analyze testing.test;

Explain analyze result for table with partial index:

Index Scan using lower_rt_inf on test  (cost=0.04..10939.03 rows=1051995 width=14) (actual time=0.037..52.083 rows=283920 loops=1)
  Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.156 ms
Execution time: 62.900 ms

And:

SELECT null_frac, n_distinct, correlation FROM pg_catalog.pg_stats WHERE tablename = 'lower_rt_inf'

null_frac |n_distinct |correlation |
----------|-----------|------------|
0         |-1         |1           |

Then I create an index similar to the previous one, but without partial condition:

create index lower_rt_full ON testing.test using btree(lower(ts));
analyze testing.test;

And now the same index is used, but the cost/rows are different:

Index Scan using lower_rt_inf on test  (cost=0.04..1053.87 rows=101256 width=14) (actual time=0.029..58.613 rows=283920 loops=1)
  Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.280 ms
Execution time: 71.794 ms

And a bit more:

select * from testing.test where lower(ts)> '2017-06-17 20:00:00'::timestamp;

Index Scan using lower_rt_full on test  (cost=0.04..3159.52 rows=303767 width=14) (actual time=0.036..64.208 rows=283920 loops=1)
  Index Cond: (lower(ts) > '2017-06-17 20:00:00'::timestamp without time zone)
Planning time: 0.099 ms
Execution time: 78.759 ms

How can I effectively use partial indexes for expressions?

Laurenz Albe · Accepted Answer

What happens here is that the statistics on index lower_rt_full are used to estimate the row count, but statistics on lower_rt_inf, which is a partial index, aren't.

Since function are a black box for PostgreSQL, it has no idea about the distribution of lower(ts) and uses a bad estimate.

After lower_rt_full has been created and the table analyzed, PostgreSQL has a good idea about this distribution and can estimate much better. Even if the index isn't used to execute the query, it is used for query planning.

Since upper_inf is also a function (black box), you would get an even better estimate if you had an index ON test (upper_inf(ts), lower(ts)).

For an explanation why partial indexes are not considered to estimate the number of result rows, see this comment in examine_variable in backend/utils/adt/selfuncs.c, which tries to find statistical data about an expression:

 * Has it got stats?  We only consider stats for
 * non-partial indexes, since partial indexes probably
 * don't reflect whole-relation statistics; the above
 * check for uniqueness is the only info we take from
 * a partial index.

Inconsistent statistics on expression with partial index

Tags:

postgresql

postgresql-9.6

Dzmitry

1 Answers

Laurenz Albe

Recent Activity

Donate For Us

Inconsistent statistics on expression with partial index

Tags:

postgresql

postgresql-9.6

Dzmitry

1 Answers

Laurenz Albe

Related questions

Recent Activity

Donate For Us