Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres: why LEFT JOIN affects to query plan?

I have PostgreSQL 9.5.9 and two tables: table1 and table2

 Column   |              Type              |                 Modifiers                 
------------+--------------------------------+-------------------------------------------
 id         | integer                        | not null
 status     | integer                        | not null
 table2_id  | integer                        | 
 start_date | timestamp(0) without time zone | default NULL::timestamp without time zone
Indexes:
    "table1_pkey" PRIMARY KEY, btree (id)
    "table1_start_date" btree (start_date)
    "table1_table2" btree (table2_id)
Foreign-key constraints:
    "fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)


 Column |          Type           |            Modifiers            
--------+-------------------------+---------------------------------
 id     | integer                 | not null
 name   | character varying(2000) | default NULL::character varying
Indexes:
    "table2_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "table1" CONSTRAINT "fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)

table2 contains only 3 rows; table1 contains about 400000 rows and only half of them have some value in table_2_id column.

The query is fast enough when i select some values from table1 ordered by start_date column because table1_start_date index is effectively used:

SELECT t1.*
FROM table1 AS t1 
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;

EXPLAIN ANALYZE result

   Limit  (cost=7797.40..7798.70 rows=25 width=20) (actual time=40.994..41.006 rows=25 loops=1)
   ->  Index Scan Backward using table1_start_date on table1 t1  (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.078..36.251 rows=150025
 loops=1)
 Planning time: 0.097 ms
 Execution time: 41.033 ms

But the query become very slow when i add LEFT JOIN to fetch values from table2 too:

SELECT t1.*, t2.*
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;

EXPLAIN ANALYZE result:

 Limit  (cost=33690.80..33696.42 rows=25 width=540) (actual time=191.282..191.320 rows=25 loops=1)
   ->  Nested Loop Left Join  (cost=0.57..88317.50 rows=393216 width=540) (actual time=0.028..184.537 rows=150025 loops=1)
         ->  Index Scan Backward using table1_start_date on table1 t1  (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.018..35.196 rows=
150025 loops=1)
         ->  Index Scan using table2_pkey on table2 t2  (cost=0.14..0.16 rows=1 width=520) (actual time=0.000..0.001 rows=1 loops=150025)
               Index Cond: (id = t1.table2_id)
 Planning time: 0.210 ms
 Execution time: 191.357 ms

Why query time increased from 32ms to 191ms? As i understand, LEFT JOIN can not affect to result. So, we can select 25 rows from table1 (LIMIT 25) first and then join rows from table2 Execution time of the query shouldn't significantly increase. There are no some tricky conditions which can break the use of index, etc.

I don't understand completely the EXPLAIN ANALYZE for second query, but it seems like postgres analyzer decided to "perform join and then filter" instead of "filter and then join". In this way the query is too slow. What is the problem?

like image 614
ox160d05d Avatar asked Nov 22 '17 10:11

ox160d05d


People also ask

Why is left join more efficient than inner join?

If you dont include the items of the left joined table, in the select statement, the left join will be faster than the same query with inner join. If you do include the left joined table in the select statement, the inner join with the same query was equal or faster than the left join.

Are left JOINs slower than inner JOINs?

The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.

Which join is faster in PostgreSQL?

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often.

Does PostgreSQL choose the wrong join strategy?

If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.

Why does PostgreSQL take so long to join?

In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on. Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway).

How do Postgres queries work?

SQL queries are mostly declarative: you describe what data you would like to retrieve, Postgres figures out a plan for how to get it for you, then executes that plan. This planning process is similar to how you might plan a trip: what sights do you want to see? When are they open?

How do you find the join key in PostgreSQL?

First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the = operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.


Video Answer


1 Answers

It just doesn't know that limit should apply to table1 instead of result of join, so it fetches minimum required rows, that is 150025 and then does 150025 loops on table2. If you do subselect with limit on table1 and join table2 to that subselect you should get what you want.

SELECT t1.*, t2.*
FROM (SELECT *
        FROM table1
       ORDER BY start_date DESC
       LIMIT 25 OFFSET 150000) AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id;
like image 63
Łukasz Kamiński Avatar answered Nov 11 '22 03:11

Łukasz Kamiński