We have a performance issue with a specific SQL query and we're trying to figure out how could we improve here. It's execution time on is about 20 - 100 seconds!
Here is the query and it's explain:
SELECT "jobs".* FROM "jobs"
WHERE "jobs"."status" IN (1, 2, 3, 4)
ORDER BY "jobs"."due_date" ASC
LIMIT 5;
Limit (cost=0.42..1844.98 rows=5 width=2642) (actual time=16927.150..18151.643 rows=1 loops=1)
-> Index Scan using index_jobs_on_due_date on jobs (cost=0.42..1278647.41 rows=3466 width=2642) (actual time=16927.148..18151.641 rows=1 loops=1)
Filter: (status = ANY ('{1,2,3,4}'::integer[]))
Rows Removed by Filter: 595627
Planning time: 0.205 ms
Execution time: 18151.684 ms
We are using PostgreSQL 9.6.11 on AWS RDS.
In a table we have ~500K rows. Fields realted to the query are:
We have the following indexes:
CREATE INDEX index_jobs_on_due_date ON public.jobs USING btree (due_date)
CREATE INDEX index_jobs_on_due_date_and_status ON public.jobs USING btree (due_date, status)
CREATE INDEX index_jobs_on_status ON public.jobs USING btree (status)
CREATE UNIQUE INDEX jobs_pkey ON public.jobs USING btree (id)
Thank you in advance, - Jack
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.
Just like any advanced relational database, PostgreSQL uses a cost-based query optimizer that tries to turn your SQL queries into something efficient that executes in as little time as possible.
Whitley's Genitor algorithm. The GEQO module allows the PostgreSQL query optimizer to support large join queries effectively through non-exhaustive search.
For this query:
SELECT j.*
FROM "jobs" j
WHERE j."status" IN (1, 2, 3, 4)
ORDER BY "jobs"."due_date" ASC
LIMIT 5;
The "obvious" index is on (status)
. But that may not help. The goal is to get rid of the sorting. So, you can rewrite the query and use an index jobs(status, due_date)
:
select j.*
from ((select j.*
from jobs j
where j.status = 1
order by j.due_date asc
limit 5
) union all
(select j.*
from jobs j
where j.status = 2
order by j.due_date asc
limit 5
) union all
(select j.*
from jobs j
where j.status = 3
order by j.due_date asc
limit 5
) union all
(select j.*
from jobs j
where j.status = 4
order by j.due_date asc
limit 5
)
) j
order by due_date
limit 5;
The subqueries should each use the composite index. The final sort would then be on (at most) 20 rows, which should be fast).
EDIT:
Here is a related idea, with the same index:
SELECT j.*
FROM (SELECT j.*,
ROW_NUMBER() OVER (PARTITION BY j.status ORDER BY j.due_date ASC) as seqnum
FROM "jobs" j
) j
WHERE j.status in (1, 2, 3, 4) AND seqnum <= 5
ORDER BY j.due_date ASC
LIMIT 5;
This can use the index for the ROW_NUMBER()
calculation. That might require a full table scan of the table. But, the final sort will be limited to 20 rows, so the final sort is eliminated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With