I use Python=2.7.3, PostgreSQL=9.3, Flask=0.10.0, SQLAlchemy=1.10.12, Flask-SQLAlchemy=2.1 and psycopg2=2.5.4
I have a 1 billion records table in PostgreSQL, and I have to paginate over it and provide search in it:
class MyTable(db.Model):
""" My Table """
id = db.Column(db.Integer, primary_key=True)
code = db.Column(db.String(100), index=True, unique=True)
name = db.Column(db.String(512), index=True)
__tablename__ = 'my_table'
So i did the following in the code where i request data:
records = MyTable.query.filter(**filter_list).\
order_by(asc('code')).paginate(page, per_page, False)
The point is it takes like 13 seconds to just select 10 last records for the last page if per_page=10 and page=1158960 even when there's no filtering at all.
From what i found in flask-sqlalchemy source, .paginate does it like this:
.order_by(asc('code')).limit(per_page).offset((page - 1) * per_page)
The resulting SQL query looks like this:
SELECT my_table.id, my_table.code, my_table.name
FROM my_table ORDER BY my_table.code ASC
LIMIT 10 OFFSET 1158960
When i fired it on a server console, i realized that the problem is in ORDER BY clause. Somehow it has to first sort the whole table with ORDER BY and only then LIMIT and OFFSET. But that's slow as hell.
EXPLAIN (ANALYZE):
"Limit (cost=470520.26..470520.26 rows=1 width=178) (actual time=12460.060..12460.061 rows=8 loops=1)"
" -> Sort (cost=467626.96..470520.26 rows=1157320 width=178) (actual time=11343.220..12424.686 rows=1158968 loops=1)"
" Sort Key: code"
" Sort Method: external merge Disk: 218312kB"
" -> Seq Scan on my_table (cost=0.00..42518.20 rows=1157320 width=178) (actual time=0.026..378.637 rows=1158968 loops=1)"
"Total runtime: 12475.160 ms"
If you just remove ORDER BY from that SQL request, it completes in 270ms!
"Limit (cost=42518.20..42518.20 rows=1 width=178) (actual time=269.940..269.942 rows=8 loops=1)"
" -> Seq Scan on my_table (cost=0.00..42518.20 rows=1157320 width=178) (actual time=0.030..246.200 rows=1158968 loops=1)"
"Total runtime: 269.992 ms"
Is there anything I can do about it?
Ok, I found a way to solve that.
When i do exactly the same query but with SET enable_seqscan=off; it forces PostgreSQL to use Index Scan instead of Sequence Scan, and it gets much faster!
SET enable_seqscan=off;
SELECT my_table.id, my_table.code, my_table.name
FROM my_table ORDER BY my_table.code ASC
LIMIT 10 OFFSET 1158960
**EXPLAIN (ANALYZE):**
"Limit (cost=1814764.86..1814777.39 rows=8 width=131) (actual time=616.543..616.545 rows=8 loops=1)"
" -> Index Scan using ix_my_table_code on my_table (cost=0.43..1814777.39 rows=1158968 width=131) (actual time=0.065..590.898 rows=1158968 loops=1)"
"Total runtime: 616.568 ms"
So now the point is - how do I set PostgreSQL config to make it use Index Scan without forcing it? I guess the answer is - "Planner Cost Constants". Any advices about them?
UPDATE 13.04.2016:
I finally figured out the situation and found a solution. In my case, everything is solved by setting Planner Cost Constants like this in postgresql.conf:
seq_page_cost = 1.0
random_page_cost = 1.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.0001
cpu_operator_cost = 0.0025
effective_cache_size = 1024MB
Also, there're many advises to set effective_cache_size as 3/4 from a server entire RAM. Anyway, with these settings Planner always uses Index Scan on Huge tables. So timing is now 200-300ms.
Problem Solved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With