Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OFFSET vs. ROW_NUMBER()

Tags:

postgresql

As we know, Postgresql's OFFSET requires that it scan through all the rows up until the point it gets to where you requested, which makes it kind of useless for pagination through huge result sets, getting slower and slower as the OFFSET goes up.

PG 8.4 now supports window functions. Instead of:

SELECT * FROM table ORDER BY somecol LIMIT 10 OFFSET 500 

You can say:

SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY somecol ASC) AS rownum FROM table) AS foo WHERE rownum > 500 AND rownum <= 510 

Does the latter approach help us at all ? Or do we have to keep using identifying columns and temp tables for large pagination ?

like image 373
zzzeek Avatar asked Jun 26 '10 21:06

zzzeek


People also ask

Is ROW_NUMBER faster than group by?

According to the optimizer, in my case the ROW_NUMBER is about 60% more efficient according to the subtree cost. And according to statistics IO, about 20% less CPU time. However, in real elapsed time, the ROW_NUMBER solution takes about 80% more real time. So the GROUP BY wins in my case.

What does ROW_NUMBER () mean in SQL?

ROW_NUMBER function is a SQL ranking function that assigns a sequential rank number to each new record in a partition. When the SQL Server ROW NUMBER function detects two identical values in the same partition, it assigns different rank numbers to both.

What does ROW_NUMBER () over do?

ROW_NUMBER() Function The Row_Number function is used to provide consecutive numbering of the rows in the result by the order selected in the OVER clause for each partition specified in the OVER clause. It will assign the value 1 for the first row and increase the number of the subsequent rows.

What does ROW_NUMBER () over partition by do?

The row rank is calculated using the function ROW_NUMBER(). Let's first use this function and view the row ranks. The ROW_NUMBER() function uses the OVER and PARTITION BY clause and sorts results in ascending or descending order. It starts ranking rows from 1 per the sorting order.


2 Answers

I've constructed a test which compares OFFSET, cursors, and ROW_NUMBER(). My impression of ROW_NUMBER(), that it would be consistent in speed regardless of where you are in the result set, is correct. However, that speed is dramatically slower than either OFFSET or CURSOR, which, as was also my impression, are pretty much the same in speed, both degrading in speed the further out to the end of the result you go.

Results:

offset(100,100): 0.016359 scroll(100,100): 0.018393 rownum(100,100): 15.535614  offset(100,480000): 1.761800 scroll(100,480000): 1.781913 rownum(100,480000): 15.158601  offset(100,999900): 3.670898 scroll(100,999900): 3.664517 rownum(100,999900): 14.581068 

The test script uses sqlalchemy to set up tables and 1000000 rows of test data. It then uses a psycopg2 cursor to execute each SELECT statement and fetch results with the three different methods.

from sqlalchemy import *  metadata = MetaData() engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)  t1 = Table('t1', metadata,     Column('id', Integer, primary_key=True),     Column('d1', String(50)),     Column('d2', String(50)),     Column('d3', String(50)),     Column('d4', String(50)),     Column('d5', String(50)) )  if not engine.has_table('t1'):     conn = engine.connect()     t1.create(conn)      # 1000000 rows     for i in range(100):         conn.execute(t1.insert(), [             dict(                 ('d%d' % col, "data data data %d %d" % (col, (i * 10000) + j))                 for col in range(1, 6)             ) for j in xrange(1, 10001)         ])  import time  def timeit(fn, count, *args):     now = time.time()     for i in xrange(count):         fn(*args)     total = time.time() - now     print "%s(%s): %f" % (fn.__name__, ",".join(repr(x) for x in args), total)  # this is a raw psycopg2 connection. conn = engine.raw_connection()  def offset(limit, offset):     cursor = conn.cursor()     cursor.execute("select * from t1 order by id limit %d offset %d" % (limit, offset))     cursor.fetchall()     cursor.close()  def rownum(limit, offset):     cursor = conn.cursor()     cursor.execute("select * from (select *, "                     "row_number() over (order by id asc) as rownum from t1) as foo "                     "where rownum>=%d and rownum<%d" % (offset, limit + offset))     cursor.fetchall()     cursor.close()  def scroll(limit, offset):     cursor = conn.cursor('foo')     cursor.execute("select * from t1 order by id")     cursor.scroll(offset)     cursor.fetchmany(limit)     cursor.close()  print   timeit(offset, 10, 100, 100) timeit(scroll, 10, 100, 100) timeit(rownum, 10, 100, 100)  print   timeit(offset, 10, 100, 480000) timeit(scroll, 10, 100, 480000) timeit(rownum, 10, 100, 480000)  print   timeit(offset, 10, 100, 999900) timeit(scroll, 10, 100, 999900) timeit(rownum, 10, 100, 999900) 
like image 66
zzzeek Avatar answered Sep 22 '22 05:09

zzzeek


Use a CURSOR for a large resultset, will be much faster. For small result sets the LIMIT OFFSET construction works fine, but it has it's limits.

ROW_NUMBER is a nice thing, but not for pagination. You end up with bad performance because of sequential scan's.

like image 41
Frank Heikens Avatar answered Sep 20 '22 05:09

Frank Heikens