I have PostgreSQL 9.5.9 and two tables: table1 and table2
Column | Type | Modifiers
------------+--------------------------------+-------------------------------------------
id | integer | not null
status | integer | not null
table2_id | integer |
start_date | timestamp(0) without time zone | default NULL::timestamp without time zone
Indexes:
"table1_pkey" PRIMARY KEY, btree (id)
"table1_start_date" btree (start_date)
"table1_table2" btree (table2_id)
Foreign-key constraints:
"fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)
Column | Type | Modifiers
--------+-------------------------+---------------------------------
id | integer | not null
name | character varying(2000) | default NULL::character varying
Indexes:
"table2_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "table1" CONSTRAINT "fk_t1_t2" FOREIGN KEY (table2_id) REFERENCES table2(id)
table2 contains only 3 rows; table1 contains about 400000 rows and only half of them have some value in table_2_id column.
The query is fast enough when i select some values from table1 ordered by start_date column because table1_start_date index is effectively used:
SELECT t1.*
FROM table1 AS t1
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;
EXPLAIN ANALYZE result
Limit (cost=7797.40..7798.70 rows=25 width=20) (actual time=40.994..41.006 rows=25 loops=1)
-> Index Scan Backward using table1_start_date on table1 t1 (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.078..36.251 rows=150025
loops=1)
Planning time: 0.097 ms
Execution time: 41.033 ms
But the query become very slow when i add LEFT JOIN to fetch values from table2 too:
SELECT t1.*, t2.*
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id
ORDER BY t1.start_date DESC
LIMIT 25 OFFSET 150000;
EXPLAIN ANALYZE result:
Limit (cost=33690.80..33696.42 rows=25 width=540) (actual time=191.282..191.320 rows=25 loops=1)
-> Nested Loop Left Join (cost=0.57..88317.50 rows=393216 width=540) (actual time=0.028..184.537 rows=150025 loops=1)
-> Index Scan Backward using table1_start_date on table1 t1 (cost=0.42..20439.74 rows=393216 width=20) (actual time=0.018..35.196 rows=
150025 loops=1)
-> Index Scan using table2_pkey on table2 t2 (cost=0.14..0.16 rows=1 width=520) (actual time=0.000..0.001 rows=1 loops=150025)
Index Cond: (id = t1.table2_id)
Planning time: 0.210 ms
Execution time: 191.357 ms
Why query time increased from 32ms to 191ms? As i understand, LEFT JOIN can not affect to result. So, we can select 25 rows from table1 (LIMIT 25) first and then join rows from table2 Execution time of the query shouldn't significantly increase. There are no some tricky conditions which can break the use of index, etc.
I don't understand completely the EXPLAIN ANALYZE for second query, but it seems like postgres analyzer decided to "perform join and then filter" instead of "filter and then join". In this way the query is too slow. What is the problem?
If you dont include the items of the left joined table, in the select statement, the left join will be faster than the same query with inner join. If you do include the left joined table in the select statement, the inner join with the same query was equal or faster than the left join.
The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.
Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often.
If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.
In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on. Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway).
SQL queries are mostly declarative: you describe what data you would like to retrieve, Postgres figures out a plan for how to get it for you, then executes that plan. This planning process is similar to how you might plan a trip: what sights do you want to see? When are they open?
First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the = operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.
It just doesn't know that limit should apply to table1
instead of result of join, so it fetches minimum required rows, that is 150025 and then does 150025 loops on table2
. If you do subselect with limit on table1
and join table2
to that subselect you should get what you want.
SELECT t1.*, t2.*
FROM (SELECT *
FROM table1
ORDER BY start_date DESC
LIMIT 25 OFFSET 150000) AS t1
LEFT JOIN table2 AS t2 ON t2.id = t1.table2_id;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With