Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

postgresql: offset + limit gets to be very slow

I have a table tmp_drop_ids with one column, id, and 3.3 million entries. I want to iterate over the table, doing something with every 200 entries. I have this code:

LIMIT = 200
for offset in xrange(0, drop_count+LIMIT, LIMIT):
    print "Making tmp table with ids %s to %s/%s" % (offset, offset+LIMIT, drop_count)
    query = """DROP TABLE IF EXISTS tmp_cur_drop_ids; CREATE TABLE tmp_cur_drop_ids AS
    SELECT id FROM tmp_drop_ids ORDER BY id OFFSET %s LIMIT %s;""" % (offset, LIMIT)
    cursor.execute(query)

This runs fine, at first, (~0.15s to generate the tmp table), but it will slow down occasionally, e.g. around 300k tickets it started taking 11-12 seconds to generate this tmp table, and again around 400k. It basically seems unreliable.

I will use those ids in other queries so I figured the best place to have them was in a tmp table. Is there any better way to iterate through results like this?

like image 628
Claudiu Avatar asked Nov 01 '11 23:11

Claudiu


2 Answers

Use a cursor instead. Using a OFFSET and LIMIT is pretty expensive - because pg has to execute query, process and skip a OFFSET rows. OFFSET is like "skip rows", that is expensive.

cursor documentation

Cursor allows a iteration over one query.

BEGIN
DECLARE C CURSOR FOR SELECT * FROM big_table;
FETCH 300 FROM C; -- get 300 rows
FETCH 300 FROM C; -- get 300 rows
...
COMMIT;

Probably you can use a server side cursor without explicit using of DECLARE statement, just with support in psycopg (search section about server side cursors).

like image 85
Pavel Stehule Avatar answered Oct 16 '22 11:10

Pavel Stehule


If your id's are indexed you can use "limit" with ">", for example in python-like pseudocode:

limit=200
max_processed_id=-1
query ("create table tmp_cur_drop_ids(id int)")
while true:
  query("truncate tmp_cur_drop_ids")
  query("insert into tmp_cur_drop_ids(id)" \
        + " select id from tmp_drop_ids" \
        + " where id>%d order by id limit %d" % (max_processed_id, limit))
  max_processed_id = query("select max(id) from tmp_cur_drop_ids")
  if max_processed_id == None:
    break
  process_tmp_cur_drop_ids();
query("drop table tmp_cur_drop_ids")

This way Postgres can use index for your query.

like image 29
Tometzky Avatar answered Oct 16 '22 13:10

Tometzky