Sorry, lots of context before the actual question as we've throughly researched this and I wanted to give you full context.
Some context: postgres index-only-scans rely on the visibility map (VM). If a page is not marked as not-fully-visible in the visibility map, postgres fetches that page to ensure the data is visible to this transaction, even when doing index only scans. Unfortunately, this can greatly slowdown index only scans. The index might return results from 10k rows, but the index itself only spans 50 pages (very fast in terms of IO). However, if VM isn't set, it makes and extra 10k heap fetches (200x slower in terms of IO).
Details: https://wiki.postgresql.org/wiki/Index-only_scans#The_Visibility_Map_.28and_other_relation_forks.29
Try it yourself: EXPLAIN ANALYZE an index only query, before and after a VACUUM. You can see the number of heap fetches go down after the VACUUM (assuming you had some dirty pages in the VM before)
Already tried: We're already tuned autovacuum, and we're vacuuming regularly. This helps a lot, but we'd like to get even faster.
Question (finally): Is it possible to skip heap fetches when doing index only scans? I'm aware we wouldn't have perfect MVCC when reading, but we're okay with that. The data in the index is close enough, and it's definitely not worth the overhead of thousands of heap fetches to make sure we're not looking at slightly stale data. To borrow a term from NoSQL, we'd be fine with "eventual consistency" reads.
Thanks!
To solve this performance problem, PostgreSQL supports index-only scans, which can answer queries from an index alone without any heap access. The basic idea is to return values directly out of each index entry instead of consulting the associated heap entry.
The visibility map stores two bits per heap page. The first bit, if set, indicates that the page is all-visible, or in other words that the page does not contain any tuples that need to be vacuumed. This information can also be used by index-only scans to answer queries using only the index tuple.
An index-only plan is query evaluation plan where we only need to access the indexes for the data records, and not the data records themselves, in order to answer the query. Obviously, index- only plans are much faster than regular plans since it does not require reading of the data records.
This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap).
There is no way to do what you want in PostgreSQL as it stands. It'd be interesting to do but a fair bit of work, very unlikely to be accepted into core, extremely hard to do with an extension, and likely to have worse side-effects than you probably expect.
You'd basically have to add a DIRTY READ
isolation level to PostgreSQL, except that it'd be even weaker than that because it might also return deleted data, the old versions of updated rows, and multiple values from unique indexes. This latter issue would potentially upset the query planner quite a bit, since it's allowed to assume that results from unique indexes will be unique.
I see the chances of such a change being accepted into PostgreSQL core as very close to zero. The possible use cases are very limited.
The only way I could see of justifying adding such a feature would be to make recovery from clog loss/corruption and accidental deletion easier by supporting raw reads. That'd make most sense for seqscans of the heap rather than index-only scans.
It would probably make more sense to solve this another way, such as a non-durable caching layer (Redis etc) on top of the DB for data you don't need perfectly fresh.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With