Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres consistently favoring nested loop join over merge join

I have a complex query:

SELECT DISTINCT ON (delivery.id)
delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
    JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
    JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status = 1000
    -- LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
    JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id
WHERE
    NOT EXISTS (SELECT dl_finished.id FROM mailer.mailer_message_recipient_rel_log AS dl_finished WHERE dl_finished.rel_id = delivery.id AND dl_finished.status <> 1000) AND
    dl_processing.date <= NOW() - (36000 * INTERVAL '1 second') AND
    NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
    -- (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
    NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id, dl_processing.id DESC
LIMIT 1000;

It is running very slowly, and the reason seems to be that Postgres is consistently avoiding using merge joins in its query plan despite me having all the indices that I would need for this. It looks really depressing:

http://explain.depesz.com/s/tVY

Query plan

http://i.stack.imgur.com/Myw4R.png

Why would this happen? How do I troubleshoot such an issue?


UPD: with @wildplasser's help I have reworked the query to fix performance (while changing its semantics somewhat):

SELECT delivery.id, dl_processing.pid
FROM mailer.mailer_message_recipient_rel AS delivery
    JOIN mailer.mailer_message AS message ON delivery.message_id = message.id
    JOIN mailer.mailer_message_recipient_rel_log AS dl_processing ON dl_processing.rel_id = delivery.id AND dl_processing.status in (1000, 2, 5) AND dl_processing.date <= NOW() - (36000 * INTERVAL '1 second')
    LEFT JOIN mailer.mailer_recipient AS r ON delivery.email = r.email
WHERE
    (r.times_bounced < 5 OR r.times_bounced IS NULL) AND
    NOT EXISTS (SELECT dl_other.id FROM mailer.mailer_message_recipient_rel_log AS dl_other WHERE dl_other.rel_id = delivery.id AND dl_other.id > dl_processing.id) AND
    NOT EXISTS (SELECT ml.id FROM mailer.mailer_message_log AS ml WHERE ml.message_id = message.id) AND
    NOT EXISTS (SELECT ur.id FROM mailer.mailer_unsubscribed_recipient AS ur JOIN mailer.mailer_mailing AS mailing ON message.mailing_id = mailing.id WHERE ur.email = delivery.email AND ur.list_id = mailing.list_id)
ORDER BY delivery.id
LIMIT 1000

It now runs well, but the query plan still sports these horrible nested loop joins <_<:

http://explain.depesz.com/s/MTo3

I would still like to know why that is.

like image 999
Alexei Averchenko Avatar asked Jun 04 '14 09:06

Alexei Averchenko


1 Answers

The reason is that Postgres is actually doing the right thing, and I suck at math. Suppose table A has N rows, and table B has M rows, and they are being joined via a column that they both have a B-tree index for. Then the following is true:

  • Nested loop join's time complexity is not O(MN), like I naively thought, but O(M log N) or O(N log M), depending on which table is scanned linearly. If both are scanned by an index, we get O(M log M log N) or O(N log M log N), respectively. But since this is only required if a specific order of the rows is needed for yet another join or due to the ORDER clause, as we'll see it's not a bad deal at all.
  • Merge join's time complexity is O(M log M + N log N), which means that it loses to the nested loop join, provided that the asymptotic proportionality coefficients are the same, and AFAIK they should both be equal to 1 in most implementations. Since both tables must be iterated by the same index in the same direction, if different order is required, an additional sort is required, which easily makes the complexity worse than in the case of the nested loop sort.

So basically despite being associated with the merge sort, which we all love, merge join almost always sucks.

The reason why my first query was so slow was because it had to perform sort before applying limit, and it was also bad in many other ways. After applying @wildplasser's suggestions, I managed to reduce the number of (still expensive) nested loops and also allow for limit to be taken without a sort, thus ensuring that Postgres most likely won't need to run the outer scan to its completition, which is where I derive the bulk of performance gains from.

like image 100
Alexei Averchenko Avatar answered Oct 03 '22 09:10

Alexei Averchenko