Postgres Index-only-scan: can we ignore the visibility map or avoid heap fetches?

Tags:

postgresql

Sorry, lots of context before the actual question as we've throughly researched this and I wanted to give you full context.

Some context: postgres index-only-scans rely on the visibility map (VM). If a page is not marked as not-fully-visible in the visibility map, postgres fetches that page to ensure the data is visible to this transaction, even when doing index only scans. Unfortunately, this can greatly slowdown index only scans. The index might return results from 10k rows, but the index itself only spans 50 pages (very fast in terms of IO). However, if VM isn't set, it makes and extra 10k heap fetches (200x slower in terms of IO).

Details: https://wiki.postgresql.org/wiki/Index-only_scans#The_Visibility_Map_.28and_other_relation_forks.29

Try it yourself: EXPLAIN ANALYZE an index only query, before and after a VACUUM. You can see the number of heap fetches go down after the VACUUM (assuming you had some dirty pages in the VM before)

Already tried: We're already tuned autovacuum, and we're vacuuming regularly. This helps a lot, but we'd like to get even faster.

Question (finally): Is it possible to skip heap fetches when doing index only scans? I'm aware we wouldn't have perfect MVCC when reading, but we're okay with that. The data in the index is close enough, and it's definitely not worth the overhead of thousands of heap fetches to make sure we're not looking at slightly stale data. To borrow a term from NoSQL, we'd be fine with "eventual consistency" reads.

Thanks!

819

asked Jun 16 '15 21:06

scosman

1 Answers

There is no way to do what you want in PostgreSQL as it stands. It'd be interesting to do but a fair bit of work, very unlikely to be accepted into core, extremely hard to do with an extension, and likely to have worse side-effects than you probably expect.

You'd basically have to add a DIRTY READ isolation level to PostgreSQL, except that it'd be even weaker than that because it might also return deleted data, the old versions of updated rows, and multiple values from unique indexes. This latter issue would potentially upset the query planner quite a bit, since it's allowed to assume that results from unique indexes will be unique.

I see the chances of such a change being accepted into PostgreSQL core as very close to zero. The possible use cases are very limited.

The only way I could see of justifying adding such a feature would be to make recovery from clog loss/corruption and accidental deletion easier by supporting raw reads. That'd make most sense for seqscans of the heap rather than index-only scans.

It would probably make more sense to solve this another way, such as a non-durable caching layer (Redis etc) on top of the DB for data you don't need perfectly fresh.

answered Sep 22 '22 06:09

Craig Ringer

Related questions
                            
                                What is the equivalent of timestamp/rowversion (SQL Server) with PostgreSQL
                            
                                With sqlalchemy how to dynamically bind to database engine on a per-request basis
                            
                                GeoDjango setup: ERROR: could not access file "$libdir/postgis-1.5": No such file or directory
                            
                                How to create a PostgreSQL partitioned sequence?
                            
                                Anything similar to MySQL Proxy for PostgreSQL? [closed]
                            
                                Intervals: How can I make sure there is just one row with a null value in a timstamp column in table?
                            
                                How fix double encoding in PostgreSQL?
                            
                                Storing C# datetime to postgresql TimeStamp
                            
                                How to make attribute setter send value through SQL function
                            
                                Postgresql order by and limit
                            
                                Merging two version-tracking tables while filling in values
                            
                                SQLALchemy DB Session with Flask, Postgres
                            
                                verifying data consistency between two postgresql databases
                            
                                Setting up travis.ci with Rails and Postgres
                            
                                what value to set in postgresql.conf to enable use of "localhost" and "127.0.0.1" and ip address? [closed]
                            
                                Slow GroupAggregate in PostgreSQL
                            
                                Postgresql performance comparison between arrays and joins
                            
                                django: Proper way to recover from IntegrityError
                            
                                Filter postgres JSON column by null value in SQLAlchemy
                            
                                Django/Postgres migration failing "django.db.utils.ProgrammingError: relation "django_site" does not exist"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With