Order of columns in INNER JOIN condition affects the performance badly

Tags:

I have two tables which links to each other like this:

Table answered_questions with the following columns and indexes:

id: primary key
taken_test_id: integer (foreign key)
question_id: integer (foreign key, links to another table called questions)
indexes: (taken_test_id, question_id)

Table taken_tests

id: primary key
user_id: (foreign key, links to table Users)
indexes: user_id column

First query (with EXPLAIN ANALYZE output):

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id" 
WHERE 
  "taken_tests"."user_id" = 1;

Output:

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.276 ms
 Execution time: 2.365 ms
(7 rows)

Another query (this is generated automatically by Rails when using joins method in ActiveRecord)

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE 
  "taken_tests"."user_id" = 1;

And here is the output

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.302 ms
 Execution time: 1258.035 ms
(7 rows)

The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.

Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)

I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table

Update 1:

When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)

EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE) 
SELECT
  "answered_questions".* 
FROM
  "answered_questions"
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE
  "taken_tests"."user_id" = 1;

Output

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
   Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
   Buffers: shared hit=1986
   ->  Index Scan using index_taken_tests_on_user_id on public.taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
         Output: taken_tests.id
         Index Cond: (taken_tests.user_id = 1)
         Buffers: shared hit=269
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
         Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
         Index Cond: (answered_questions.taken_test_id = taken_tests.id)
         Buffers: shared hit=1717
 Planning time: 0.238 ms
 Execution time: 2.335 ms

671

asked Dec 13 '18 02:12

The Lazy Log

1 Answers

As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.

The difference comes from the very same unit's execution time:

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)

and

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)

As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem. Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.

For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.

answered Nov 13 '22 08:11

Megadest

Related questions
                            
                                Import large sql file (>20GB)
                            
                                In jpa criteria, "in case there is at least 1 row return true"
                            
                                SQL SELECT where tag value LIKE
                            
                                Checking foreign key constraint "online"
                            
                                Selecting n random rows from a huge database with conditions
                            
                                VACUUM on Redshift (AWS) after DELETE and INSERT
                            
                                Doing "group by" in django but still retaining complete object
                            
                                jOOQ - General way to insert multiple data and get generated IDs
                            
                                Pass Parameters to OLE DB SOURCE
                            
                                Renaming a column without breaking the scripts and stored procedures
                            
                                How can i introduce multiple conditions in LIKE operator for MS Access SQL
                            
                                Export jtable to excel in javascript on IE
                            
                                Does SqlConnection processes queries in parallel?
                            
                                Variables in PostgreSQL with UUID
                            
                                Network path not found exception encountered randomly
                            
                                Efficient way of storing date ranges
                            
                                SQL Server - Get the Total Count with the TOP 1 product
                            
                                Generate an increment ID that is unique for a given value of a foreign key
                            
                                SQL JOIN between A, B and C mixing full and left join
                            
                                MySQL v8.0.13 Warning 1287 Setting user variables within expressions is deprecated and will be removed in a future release

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Order of columns in INNER JOIN condition affects the performance badly

Tags:

sql

postgresql

ruby-on-rails

Update 1:

The Lazy Log

People also ask

1 Answers

Megadest

Recent Activity

Donate For Us