Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would a left join cause an optimizer to ignore an index?

Using postgres 9.6.11, I have a schema like:

owner:

id: BIGINT (PK)
dog_id: BIGINT NOT NULL (FK)
cat_id: BIGINT NULL (FK)

index DOG_ID_IDX (dog_id)
index CAT_ID_IDX (cat_id)

animal:

id: BIGINT (PK)
name: VARCHAR(50) NOT NULL

index NAME_IDX (name)

In some example data:

owner table:

| id | dog_id | cat_id |
| -- | ------ | ------ |
| 1  | 100    | 200    |
| 2  | 101    | NULL   |

animal table:

| id  | name     |
| --- | -------- |
| 100 | "fluffy" |
| 101 | "rex"    |
| 200 | "tom"    |

A common query I need to perform is to find owners by their pets name, which I thought to accomplish with a query like:

select *
from owner o
    join animal dog on o.dog_id = dog.id
    left join animal cat on o.cat_id = cat.id
where dog.name = "fluffy" or cat.name = "fluffy";

But the plan I get back from this I don't understand:

Hash Join  (cost=30304.51..77508.31 rows=3 width=899)
  Hash Cond: (dog.id = owner.dog_id)
  Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
  ->  Seq Scan on animal dog  (cost=0.00..17961.23 rows=116623 width=899)
  ->  Hash  (cost=28208.65..28208.65 rows=114149 width=19)
        ->  Hash Left Join  (cost=20103.02..28208.65 rows=114149 width=19)
              Hash Cond: (owner.cat_id = cat.id)
              ->  Seq Scan on owner o  (cost=0.00..5849.49 rows=114149 width=16)
              ->  Hash  (cost=17961.23..17961.23 rows=116623 width=19)
                    ->  Seq Scan on animal cat  (cost=0.00..17961.23 rows=116623 width=19)

I don't understand why the query plan is doing a sequential scan. I thought that the optimizer would be smart enough to scan the animal table once, or even twice using the name index, and join back to the owner table based on this result, but instead I wind up with a very unexpected query plan.

I took a simpler case where we only want to look up dog names and the query behaves as I'd expect:

select *
from owner o
    join animal dog on o.dog_id = dog.id
where dog.name = "fluffy";

This query produces an a plan I understand, using the index on animal.name:

Nested Loop  (cost=0.83..16.88 rows=1 width=1346)
  ->  Index Scan using DOG_ID_IDX on animal dog  (cost=0.42..8.44 rows=1 width=899)
        Index Cond: ((name)::text = 'fluffy'::text)
  ->  Index Scan using dog_id on owner o  (cost=0.42..8.44 rows=1 width=447)
        Index Cond: (dog_id = b.id)

Even doing the query with two inner joins produces a query plan I would expect:

select * 
from owner o
  join animal dog on o.dog_id = dog.id
  join animal cat on o.cat_id = cat.id
where dog.name = 'fluffy' or cat.name = 'fluffy';
Merge Join  (cost=35726.09..56215.53 rows=3 width=2245)
  Merge Cond: (owner.cat_id = cat.id)
  Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
  ->  Nested Loop  (cost=0.83..132348.38 rows=114149 width=1346)
        ->  Index Scan using CAT_ID_IDX on owner o  (cost=0.42..11616.07 rows=114149 width=447)
        ->  Index Scan using animal_pkey on animal dog  (cost=0.42..1.05 rows=1 width=899)
              Index Cond: (id = owner.dog_id)
  ->  Index Scan using animal_pkey on animal cat  (cost=0.42..52636.91 rows=116623 width=899)

So it looks like the left join to animal is causing the optimizer to ignore the index.

Why does doing the additional left join to animal seem to cause the optimizer to ignore the index?

EDIT: EXPLAIN (analyse, buffers) yields:

Hash Left Join  (cost=32631.95..150357.57 rows=3 width=2245) (actual time=6696.935..6696.936 rows=0 loops=1)
  Hash Cond: (o.cat_id = cat.id)
  Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
  Rows Removed by Filter: 114219
  Buffers: shared hit=170464 read=18028 dirtied=28, temp read=13210 written=13148
  ->  Merge Join  (cost=0.94..65696.37 rows=114149 width=1346) (actual time=1.821..860.643 rows=114219 loops=1)
        Merge Cond: (o.dog_id = dog.id)
        Buffers: shared hit=170286 read=1408 dirtied=28
        ->  Index Scan using DOG_ID_IDX on owner o  (cost=0.42..11402.48 rows=114149 width=447) (actual time=1.806..334.431 rows=114219 loops=1)
              Buffers: shared hit=84787 read=783 dirtied=13
        ->  Index Scan using animal_pkey on animal dog  (cost=0.42..52636.91 rows=116623 width=899) (actual time=0.006..300.507 rows=116977 loops=1)
              Buffers: shared hit=85499 read=625 dirtied=15
  ->  Hash  (cost=17961.23..17961.23 rows=116623 width=899) (actual time=5626.780..5626.780 rows=116977 loops=1)
        Buckets: 8192  Batches: 32  Memory Usage: 3442kB
        Buffers: shared hit=175 read=16620, temp written=12701
        ->  Seq Scan on animal cat  (cost=0.00..17961.23 rows=116623 width=899) (actual time=2.519..5242.106 rows=116977 loops=1)
              Buffers: shared hit=175 read=16620
Planning time: 1.245 ms
Execution time: 6697.357 ms
like image 369
And1 Avatar asked Apr 23 '19 17:04

And1


People also ask

Why are left joins slower than inner joins?

IS LEFT join slower than join? The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work. From the EXPLAIN output, it looks like MySQL is doing nested loop join.

What is the purpose of LEFT join?

A left join is used when a user wants to extract the left table's data only. Left join not only combines the left table's rows but also the rows that match alongside the right table. 2.

Which is faster left or inner join?

If you dont include the items of the left joined table, in the select statement, the left join will be faster than the same query with inner join. If you do include the left joined table in the select statement, the inner join with the same query was equal or faster than the left join.

How does MySQL choose which index to use?

If there is a choice between multiple indexes, MySQL normally uses the index that finds the smallest number of rows (the most selective index). If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows.


1 Answers

The left join needs to keep all rows in the first table. Hence, it is going to generally scan that table, even where conditions filter other tables on those conditions.

The query plan produced by Postgres is not surprising.

like image 195
Gordon Linoff Avatar answered Sep 26 '22 16:09

Gordon Linoff