Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving OFFSET performance in PostgreSQL

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.

Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.

However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.

It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.

Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."

like image 361
James Tauber Avatar asked Jul 07 '11 23:07

James Tauber


People also ask

Why offset pagination is bad?

As we can see, OFFSET pagination has some drawbacks: For a high database volume, the end pages are harder to retrieve than the beginning pages, as the number of rows to load and skip is high. For a growing database, it becomes less and less efficient to reach the beginning rows over time.

What does offset do in PostgreSQL?

OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument. If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting to count the LIMIT rows that are returned.


2 Answers

You might want a computed index.

Let's create a table:

create table sales(day date, amount real); 

And fill it with some random stuff:

insert into sales      select current_date + s.a as day, random()*100 as amount     from generate_series(1,20); 

Index it by day, nothing special here:

create index sales_by_day on sales(day); 

Create a row position function. There are other approaches, this one is the simplest:

create or replace function sales_pos (date) returns bigint     as 'select count(day) from sales where day <= $1;'     language sql immutable; 

Check if it works (don't call it like this on large datasets though):

select sales_pos(day), day, amount from sales;       sales_pos |    day     |  amount       -----------+------------+----------              1 | 2011-07-08 |  41.6135              2 | 2011-07-09 |  19.0663              3 | 2011-07-10 |  12.3715     .................. 

Now the tricky part: add another index computed on the sales_pos function values:

create index sales_by_pos on sales using btree(sales_pos(day)); 

Here is how you use it. 5 is your "offset", 10 is the "limit":

select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;          day     | amount       ------------+---------      2011-07-12 | 94.3042      2011-07-13 | 12.9532      2011-07-14 | 74.7261     ............... 

It is fast, because when you call it like this, Postgres uses precalculated values from the index:

explain select * from sales    where sales_pos(day) >= 5 and sales_pos(day) < 5+10;                                      QUERY PLAN                                     --------------------------------------------------------------------------      Index Scan using sales_by_pos on sales  (cost=0.50..8.77 rows=1 width=8)        Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15)) 

Hope it helps.

like image 86
Mike Ivanov Avatar answered Sep 21 '22 17:09

Mike Ivanov


I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.

SELECT * FROM massive_table WHERE id IN (     SELECT id     FROM massive_table     WHERE ...     LIMIT 50     OFFSET 500000 ); 

The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.

like image 39
Flimzy Avatar answered Sep 23 '22 17:09

Flimzy