Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flask-SqlAlchemy query with order_by and pagination is very slow

I use Python=2.7.3, PostgreSQL=9.3, Flask=0.10.0, SQLAlchemy=1.10.12, Flask-SQLAlchemy=2.1 and psycopg2=2.5.4

I have a 1 billion records table in PostgreSQL, and I have to paginate over it and provide search in it:

class MyTable(db.Model):
    """ My Table """
    id = db.Column(db.Integer, primary_key=True)
    code = db.Column(db.String(100), index=True, unique=True)
    name = db.Column(db.String(512), index=True)
    __tablename__ = 'my_table'

So i did the following in the code where i request data:

records = MyTable.query.filter(**filter_list).\
    order_by(asc('code')).paginate(page, per_page, False)

The point is it takes like 13 seconds to just select 10 last records for the last page if per_page=10 and page=1158960 even when there's no filtering at all.

From what i found in flask-sqlalchemy source, .paginate does it like this:

.order_by(asc('code')).limit(per_page).offset((page - 1) * per_page)

The resulting SQL query looks like this:

SELECT my_table.id, my_table.code, my_table.name 
FROM my_table ORDER BY my_table.code ASC 
LIMIT 10 OFFSET 1158960

When i fired it on a server console, i realized that the problem is in ORDER BY clause. Somehow it has to first sort the whole table with ORDER BY and only then LIMIT and OFFSET. But that's slow as hell.

EXPLAIN (ANALYZE):

"Limit  (cost=470520.26..470520.26 rows=1 width=178) (actual time=12460.060..12460.061 rows=8 loops=1)"
"  ->  Sort  (cost=467626.96..470520.26 rows=1157320 width=178) (actual time=11343.220..12424.686 rows=1158968 loops=1)"
"        Sort Key: code"
"        Sort Method: external merge  Disk: 218312kB"
"        ->  Seq Scan on my_table  (cost=0.00..42518.20 rows=1157320 width=178) (actual time=0.026..378.637 rows=1158968 loops=1)"
"Total runtime: 12475.160 ms"

If you just remove ORDER BY from that SQL request, it completes in 270ms!

"Limit  (cost=42518.20..42518.20 rows=1 width=178) (actual time=269.940..269.942 rows=8 loops=1)"
"  ->  Seq Scan on my_table  (cost=0.00..42518.20 rows=1157320 width=178) (actual time=0.030..246.200 rows=1158968 loops=1)"
"Total runtime: 269.992 ms"

Is there anything I can do about it?

like image 600
Tosh Avatar asked Apr 08 '16 14:04

Tosh


1 Answers

Ok, I found a way to solve that.

When i do exactly the same query but with SET enable_seqscan=off; it forces PostgreSQL to use Index Scan instead of Sequence Scan, and it gets much faster!

SET enable_seqscan=off;
SELECT my_table.id, my_table.code, my_table.name 
FROM my_table ORDER BY my_table.code ASC 
LIMIT 10 OFFSET 1158960

**EXPLAIN (ANALYZE):**

"Limit  (cost=1814764.86..1814777.39 rows=8 width=131) (actual time=616.543..616.545 rows=8 loops=1)"
"  ->  Index Scan using ix_my_table_code on my_table  (cost=0.43..1814777.39 rows=1158968 width=131) (actual time=0.065..590.898 rows=1158968 loops=1)"
"Total runtime: 616.568 ms"

So now the point is - how do I set PostgreSQL config to make it use Index Scan without forcing it? I guess the answer is - "Planner Cost Constants". Any advices about them?

UPDATE 13.04.2016:

I finally figured out the situation and found a solution. In my case, everything is solved by setting Planner Cost Constants like this in postgresql.conf:

seq_page_cost = 1.0
random_page_cost = 1.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.0001
cpu_operator_cost = 0.0025
effective_cache_size = 1024MB

Also, there're many advises to set effective_cache_size as 3/4 from a server entire RAM. Anyway, with these settings Planner always uses Index Scan on Huge tables. So timing is now 200-300ms.

Problem Solved.

like image 118
Tosh Avatar answered Nov 15 '22 05:11

Tosh