Understanding Order by using top-N heapsort in postgres

Question

In a theoretical scenario in the absence of indexes, an order by with limit has to scan all the data, then apply the order by and then only apply the limit since we can get the top 10 rows(for example) only after sorting.

But postgres is little smarter here and the plans below give the story.

Order by with limit

learning=# explain (analyze,buffers) select * from temp order by userid limit 10;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=81745.51..81745.54 rows=10 width=41) (actual time=2064.275..2064.278 rows=10 loops=1)
   Buffers: shared hit=13 read=18644
   ->  Sort  (cost=81745.51..86735.41 rows=1995958 width=41) (actual time=2064.273..2064.274 rows=10 loops=1)
         Sort Key: userid
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=13 read=18644
         ->  Seq Scan on temp  (cost=0.00..38613.58 rows=1995958 width=41) (actual time=35.053..1652.660 rows=1995958 loops=1)
               Buffers: shared hit=10 read=18644
 Planning time: 0.167 ms
 Execution time: 2064.335 ms
(10 rows)

Order by without limit

learning=# explain (analyze,buffers) select * from temp order by userid;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Sort  (cost=308877.61..313867.51 rows=1995958 width=41) (actual time=2685.680..3293.698 rows=1995958 loops=1)
   Sort Key: userid
   Sort Method: external merge  Disk: 99504kB
   Buffers: shared hit=42 read=18612, temp read=12440 written=12440
   ->  Seq Scan on temp  (cost=0.00..38613.58 rows=1995958 width=41) (actual time=0.069..286.556 rows=1995958 loops=1)
         Buffers: shared hit=42 read=18612
 Planning time: 0.066 ms
 Execution time: 3540.545 ms
(8 rows)

What my assumption from this is postgres uses an algorithm called heap sort (pretty well known) and stops when the first N (limit) rows are obtained.

From this visualization, I am unable to understand how this works ? Can anybody throw some light on understanding this.Is my assumption above correct ?

Craig Ringer · Accepted Answer

I haven't read the source in depth, but as far as I can tell it maintains a heap with a bounded size.

It consumes the input values in sequence. After it fills the heap up to the target number of tuples it starts examining each new value to see if it's larger than all current values, smaller than all current values, or fits within the heap.

If it's larger than all current values (assuming ASC sort) it gets discarded, since we have enough lower values already.

If it's smaller than all current values or than some current values, it's inserted at the appropriate point in the heap, everything gets shifted down by one, and it bumps the last entry off the heap.

See src/backend/utils/sort/tuplesort.c around line 1618 (in git master), case TSS_BOUNDED: in puttuple_common.

devesh · Answer

PostgreSQL use a "top-N sort heapsort" for this. You can see that in effect when doing "EXPLAIN ANALYZE" on a query like this. Without an index you cannot avoid a full table scan. However the top-N heapsort avoids allocating memory for ALL the rows, since you only care about the first 10 ones.

For example, I had a 6 million entry table sitting around, and asked for top 10 rows by an unindexed column. With LIMIT 10 applied it told me:

      Sort Method: top-N heapsort  Memory: 25kB
      Without:
      Sort Method: quicksort  Memory: (some value)kB

So if Postgres had to store all the 6 million rows sorted it would need that much MB of work memory. If (see SHOW work_mem) it was below that that would lead to it writing a lot of temporary files to disk:

      Sort Method: external merge  Disk: 99504kB

IN order to use quick memory scan you have to increase work_mem parameter

Understanding Order by using top-N heapsort in postgres

Tags:

performance

database

postgresql

Greedy Coder

2 Answers

Craig Ringer

devesh

Recent Activity

Donate For Us

Understanding Order by using top-N heapsort in postgres

Tags:

performance

database

postgresql

Greedy Coder

2 Answers

Craig Ringer

devesh

Related questions

Recent Activity

Donate For Us