Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing window function in PostgreSQL to use index

I have a table in a PostgreSQL 9.2 DB, created and filled as follows:

CREATE TABLE foo( id integer, date date );

INSERT INTO foo
SELECT (id % 10) + 1, now() - (id % 50) * interval '1 day'
FROM generate_series(1, 100000) AS id;

Now, I need to find all pairs (id, date) such that the date is a maximum one among all pairs with the same id. The query is kind of well known and it's common to use the window function called ROW_NUMBER()

SELECT id, date
FROM (
    SELECT id, date, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) rn
    FROM foo
) sbt
WHERE sbt.rn = 1;

Now, I ask for the plan of that query and figured out that the WindowAgg node requires a table to be sorted first.

Subquery Scan on sbt  (cost=11116.32..14366.32 rows=500 width=8) (actual time=71.650..127.809 rows=10 loops=1)
  Filter: (sbt.rn = 1)
  Rows Removed by Filter: 99990
    ->  WindowAgg  (cost=11116.32..13116.32 rows=100000 width=8) (actual time=71.644..122.476 rows=100000 loops=1)
          ->  Sort  (cost=11116.32..11366.32 rows=100000 width=8) (actual time=71.637..92.081 rows=100000 loops=1)
                Sort Key: foo.id, foo.date
                Sort Method: external merge  Disk: 1752kB
                ->  Seq Scan on foo  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.006..6.138 rows=100000 loops=1)

As expected, the sorting took the majority of the query execution time and it would definitely helpful to use indexes.

So I create the CREATE INDEX ON foo(id, date) and expect that now it'll be use indexes. But it doesn't. I got the same plan with the external merge and even turning off the sequential scan wasn't useful. I just ended up with Bitmap Index Scan

->  Sort  (cost=12745.58..12995.58 rows=100000 width=8) (actual time=69.247..90.003 rows=100000 loops=1)
      Sort Key: foo.id, foo.date
      Sort Method: external merge  Disk: 1752kB
      ->  Bitmap Heap Scan on foo  (cost=1629.26..3072.26 rows=100000 width=8) (actual time=5.359..12.639 rows=100000 loops=1)
            ->  Bitmap Index Scan on foo_id_date_idx  (cost=0.00..1604.26 rows=100000 width=0) (actual time=5.299..5.299 rows=100000 loops=1)

QUESTION
Can WindowAgg ever use indexes for sorting? I think that it can't but don't understand why... GroupAggreagtes can and it gives good performance improving.

like image 874
St.Antario Avatar asked Oct 14 '15 06:10

St.Antario


2 Answers

To match the index you created:

CREATE INDEX ON foo(id, date)

you would have to make that:

ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC NULLS LAST) 

which is the perfect reverse order of ASC.

That aside, you could just run:

SELECT DISTINCT ON (id)
       id, date
FROM   foo
ORDER  BY id, date DESC NULLS LAST;

But that's probably not what you wanted to ask. Either way, I would make the index:

CREATE INDEX ON foo(id, date DESC NULLS LAST)

so that max(date) is the first index entry per id. Related:

  • PostgreSQL sort by datetime asc, null first?
  • Why do NULL values come first when ordering DESC in a PostgreSQL query?
  • Select first row in each GROUP BY group?
  • Optimize groupwise maximum query
like image 138
Erwin Brandstetter Avatar answered Oct 19 '22 22:10

Erwin Brandstetter


You can rewrite the logic "the date is a maximum one among all pairs with the same id" to a direct translation:

SELECT id, date
FROM (
    SELECT id, date, MAX(date) OVER (PARTITION BY id) as maxDate
    FROM foo
) sbt
WHERE date = maxDate;

It's not exactly the same as a ROW_NUMBER, but a RANK, it might return multiple rows with the same date

like image 32
dnoeth Avatar answered Oct 19 '22 21:10

dnoeth