Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query with ORDER BY is 13 times as slow when I add LIMIT 1

I have this query (in postgresql):

SELECT "table_1".* FROM "table_1"
INNER JOIN "join_table"
  ON "table_1"."id" = "join_table"."table_1_id"
WHERE "join_table"."table_2_id" = 650727
ORDER BY table_1.created_at DESC
LIMIT 1

Which returns 1 result, but is taking ~250-300 ms to execute

There are btree indexes on table_1.created_at, as well as join_table.table_1_id and join_table.table_2_id

When I ONLY remove the LIMIT 1 from the query, the execution time drops down to ~13ms. This specific query currently only returns one result (without the LIMIT), but there are others with a different value in the WHERE that may return more (which is why the LIMIT is necessary).

Why is adding a LIMIT to a query that is already only returning a single result bloating the execution time so much?

Here is the explain plan with the LIMIT 1 (these are always hard for me to fully understand...): http://explain.depesz.com/s/rOy

And here is the explain plan without LIMIT 1: http://explain.depesz.com/s/q3d7

Additionally, if I keep the LIMIT 1, but change the order by to ASC, the query goes down to 13 ms also. And if I change the LIMIT to LIMIT 20 (But keep the ORDER BY DESC) it only takes 22ms... wtf!?

So it has something to do with the combination of ORDER BY DESC, and LIMIT 1 (Exactly 1)

like image 430
nzifnab Avatar asked May 19 '15 19:05

nzifnab


1 Answers

Ok, this is a pretty classic case.

Whenever you use LIMIT (or the like such as FETCH FIRST ... ROWS ONLY) the optimizer attempts to optimize the query so that fetching only the first rows(s) is as fast as possible. That means that the optimizer has a preference for execution plans where the first cost value is low, not the second one shown in the execution plan. Remember: the two cost values shown by PostgreSQL (e.g., cost=48.150..6,416.240 are the setup cost (48.150) and the total execution cost (6,416.240).

The "problem" here is that you have an index which supports your ORDER BY clause. So, PostgreSQL thinks that it can just go through this index (in reverse order due to the DESC modifier in your query) and check for each row in the other table whether it satisfies the other WHERE clause or not. The problem is that the optimizer has no way of knowing whether that will be one of the very first rows or rather one at the end (according to the ORDER BY). The optimizer does an arbitrary guess an believes the matching row will be more towards the begin than the end. This optimistic estimate is then used to calculate the cost value which turns out to be too optimistic so that PostgreSQL finally settles down on a bad execution plan.

When you change the ORDER BY ... DESC to ORDER BY ... ASC the optimizer does the same arbitrary but optimistic estimate which turns out to be more correct in that case, hence you get better execution time.

However, from optimizations perspective, the root cause is that the optimizer estimates that 2,491 rows will match the WHERE clause tango = 650727. When the optimizer would correctly estimate that this just hits a few rows, then the problem would likely not occur.

The WHERE clause is sufficiently trivial that a good estimate should be no problem. So, the main question is: how about your statistics on that table?

There are several ways to cope with this problem:

  • Update your statistics (ANALYZE) and see if that alone helps.
  • Increase the number of most common values stored for that column (ALTER TABLE ... SET STATISTICS). This also increases the sample size used to gather the statistics which means ANALYZE takes longer but yields more accurate results.

In theory, this should be enough to fix that problem. However, other options are:

  • If you don't need the index on created_at for other reasons (like other queries), get rid of it.
  • Re-write the query so that the bad execution plan is no option any more. In particular, it would be great if you could write the query so that the ORDER BY clause uses the same table as the WHERE clause: if you are lucky, you might have a column in join_table that has the same order as table_1.created_at so that it does not make any difference on which you order. However, be careful, this is easy to get wrong (e.g., sequential numbers filled by sequences might have outliners).
like image 146
Markus Winand Avatar answered Oct 23 '22 21:10

Markus Winand