Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Order of columns in INNER JOIN condition affects the performance badly

I have two tables which links to each other like this:

Table answered_questions with the following columns and indexes:

  • id: primary key
  • taken_test_id: integer (foreign key)
  • question_id: integer (foreign key, links to another table called questions)
  • indexes: (taken_test_id, question_id)

Table taken_tests

  • id: primary key
  • user_id: (foreign key, links to table Users)
  • indexes: user_id column

First query (with EXPLAIN ANALYZE output):

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id" 
WHERE 
  "taken_tests"."user_id" = 1;

Output:

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.276 ms
 Execution time: 2.365 ms
(7 rows)

Another query (this is generated automatically by Rails when using joins method in ActiveRecord)

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE 
  "taken_tests"."user_id" = 1;

And here is the output

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.302 ms
 Execution time: 1258.035 ms
(7 rows)

The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.

Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)

I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table

Update 1:

When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)

EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE) 
SELECT
  "answered_questions".* 
FROM
  "answered_questions"
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE
  "taken_tests"."user_id" = 1;

Output

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
   Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
   Buffers: shared hit=1986
   ->  Index Scan using index_taken_tests_on_user_id on public.taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
         Output: taken_tests.id
         Index Cond: (taken_tests.user_id = 1)
         Buffers: shared hit=269
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
         Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
         Index Cond: (answered_questions.taken_test_id = taken_tests.id)
         Buffers: shared hit=1717
 Planning time: 0.238 ms
 Execution time: 2.335 ms
like image 671
The Lazy Log Avatar asked Dec 13 '18 02:12

The Lazy Log


People also ask

Does order of inner join matter performance?

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

Which table should come first in Inner join?

While it makes no difference (performance wise) to MySQL which order you put the tables in with INNER JOIN (MySQL treats them as equal and will optimize them the same way), it's convention to put the table that you are applying the WHERE clause to first.

Does the order of the on condition of a left join matter?

The order doesn't matter for INNER joins but the order matters for (LEFT, RIGHT or FULL) OUTER joins. Outer joins are not commutative.


1 Answers

As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.

The difference comes from the very same unit's execution time:

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)

and

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)

As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem. Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.

For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.

like image 94
Megadest Avatar answered Nov 13 '22 08:11

Megadest