Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why this query is not using index only scan in postgresql

Tags:

sql

postgresql

I have a table with 16 columns, in which there are a primary key and a column to store values. I want to select all the values in a certain range. The value column (easyid) has been indexed.

create table tb1 (
    id Int primary key,
    easyid Int,
    .....
)
create index i_easyid on tb1 (easyid)

Other info: postgresql 9.4, no auto vacuum. The sql is like this.

select "easyid" from "tb1" where "easyid" between 12183318 and 82283318

Theoretically postgresql should use index only scan on i_easyid. It only do index only scan when the range "easyid" between A and B is small. When the range is large, namely B-A is a pretty big number, postgresql uses bitmap index scan on i_easyid and then bit heap scan on tb1.

I was wrong to say index scan only or not depends on the range size. I tried the same query with different parameters, sometimes it is index scan only sometimes it is not.

The table tb1 is very large up to 17G. i_easyid is 600MB.

Here is the explain of sql. And I don't understand why 4000 rows can cost more than 10 seconds.

sample_pg=# explain analyze select easyid from tb1 where "easyid" between 152183318 and 152283318;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tb1  (cost=97.70..17227.71 rows=4416 width=4) (actual time=1.155..14346.311 rows=5004 loops=1)
   Recheck Cond: ((easyid >= 152183318) AND (easyid <= 152283318))
   Heap Blocks: exact=4995
   ->  Bitmap Index Scan on i_easyid  (cost=0.00..96.60 rows=4416 width=0) (actual time=0.586..0.586 rows=5004 loops=1)
         Index Cond: ((easyid >= 152183318) AND (easyid <= 152283318))
 Planning time: 0.080 ms
 Execution time: 14348.037 ms
(7 rows)

Here is an example of index only scan:

sample_pg=# explain analyze verbose select easyid from tb1 where "easyid" between 32280318 and 32283318;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using i_easyid on public.tb1  (cost=0.44..281.82 rows=69 width=4) (actual time=14.585..160.624 rows=33 loops=1)
   Output: easyid
   Index Cond: ((tb1.easyid >= 32280318) AND (tb1.easyid <= 32283318))
   Heap Fetches: 33
 Planning time: 0.085 ms
 Execution time: 160.654 ms
(6 rows)
like image 265
worldterminator Avatar asked Apr 06 '15 08:04

worldterminator


People also ask

Why is Postgres query not using index?

The two main reasons. There are two main reasons that Postgres will not use an index. Either it can't use the index, or it doesn't think using the index will be faster. Working out which of these is the reason in your case is a great starting point.

What is index-only scan?

An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it's set, the row is known visible and so the data can be returned with no further work.


2 Answers

I'm not 100% sure, but I suspect that PostgreSQL believes that it is going to be be faster to read the table than the index, because of the random_page_cost. The index read is potentially higher cost because of the need to find essentially random pages in it.

The data retrieved from the table is going to need sorting, but the calculations probably suggest that the total cost of (sequential table read + sort) is greater than (random index reads).

This is partially testable by changing the value of random_page_cost, which would be worth investigating if you're using very fast disks or an SSD anyway.

like image 122
David Aldridge Avatar answered Oct 28 '22 03:10

David Aldridge


autovacuum is not running

PostgreSQL index-only scans require some information about which rows are "visible" to current transactions - i.e. not deleted, not old versions of updated rows, and not uncommitted inserts or new versions of updates.

This information is kept in the "visibility map".

The visibility map is maintained by VACUUM, usually in the background by autovacuum workers.

If autovacuum is not keeping up with write activity well, or if autovacuum has been disabled, then index-only scans probably won't be used because PostgreSQL will see that the visibility map does not have data for enough of the table.

Turn autovaccum back on. Then manually VACUUM the table to get it up to date immediately.

BTW, in addition to visibility map information, autoVACUUM can also write hint-bit information that can make SELECTs of recently inserted/updated data faster.

Autovacuum also maintains table statistics that are vital for effective query planning. Turning it off will result in the planner using increasingly stale information.

It is also absolutely vital for preventing an issue called transaction-ID wrap-around, which is an emergency condition that can cause the whole database to go into emergency shut-down until a time-consuming whole-table VACUUM is performed.

Do not turn autovacuum off.

As for why it's sometimes using an index-only scan and sometimes not, a few possibilities:

  • The current random_page_cost setting makes it think that random I/O will be slower than it really is, so it tries harder to avoid it;

  • The table statistics, especially the limit values, are outdated. So it doesn't realise that there's a good chance the value being looked for will be discovered quickly in an index-only scan;

  • The visibility map is outdated, so it thinks an index-only scan will find too many values that will require heap fetches to check, making it slower than other methods especially if it thinks the proportion of values likely to be found is high.

Most of these issues are fixed by leaving autovacuum alone. In fact, on frequently appended tables you should set autovacuum to run much more often than the default so it updates the limit statistics more. (Doing that helps work around PostgreSQL's planner issues with tables where the most frequently queried data is the most recently inserted with an incrementing ID or timestamp that means the most-desired values are never in the table histograms and limit stats).

Go turn autovacuum back on - then turn it up.

like image 32
Craig Ringer Avatar answered Oct 28 '22 04:10

Craig Ringer