Why are nested loops chosen causing long execution time for "self join"

Question

Disclaimer: this is not a question of what to do to improve performance, but rather why is it bad in first place.

This following query is actually an essence of some bigger query, but small enough to demonstrate the issue I do not understand.

The tables involved are (skipping columns which are - I hope - irrelevant):

create table StanyJednostek (JednostkaID nchar(5), IndeksID nchar(18),
primary key (JednostkaID, IndeksID))

create table Jednostki (JednostkaID nchar(5),
primary key (JednostkaID))

StanyJednostek contains 29187 rows, while there are 1676 distinct IndeksID values in this table). Jednostki contains 94 rows.

And now, this query takes over two minutes to complete:

select
    StanyJednostek.JednostkaID, StanyJednostek.IndeksID
from StanyJednostek
    inner join
        (select distinct IndeksID from StanyJednostek) as Zmiany
        on StanyJednostek.IndeksID = Zmiany.IndeksID 
    inner join
        Jednostki on StanyJednostek.JednostkaID = Jednostki.JednostkaID

Here is execution plan:

enter image description here

What bothers me is this huge number of actual rows: 607147974. This obviously takes two minutes to complete. While I understand where this number comes from (this is 29187 times 20802, and 20802 is number of successful joins between StanyJednostek and Jednostki), I do not quite get why does query optimizer decide to choose nested loops here? Why is not Zmiany some kind of temporary set that is iterated over instead of the whole source table? What's also interesting is that while last two lines of the query seems irrelevant, if I remove those lines, execution plan changes and nested loops are replaced with hashes:

select
    StanyJednostek.JednostkaID, StanyJednostek.IndeksID
from StanyJednostek
    inner join
        (select distinct IndeksID from StanyJednostek) as Zmiany
        on StanyJednostek.IndeksID = Zmiany.IndeksID

Note that query optimizer also stops suggesting creating additional index on IndeksID in StanyJednostek.

enter image description here

Using HASH hint on either join leads to the following execution plan:

enter image description here

Code Different · Accepted Answer

SQL Server reorders the joins to what it thinks is most efficient. In this case, it guesses wrong. Notice from your first execution plan that the join order is as follows:

StanyJednostek
INNER JOIN Jednostki 
INNER JOIN (SELECT DISTINCT IndeksID FROM StanyJednostek)

The first join is hardly anything to write home about - 29187 to 94 rows is not a problem. But the Query Optimizer guessed wrong about the result set from this join. It thinks this temporary result set only has 1 row.

Consequently, it picks a Nested Loop and thought it would scan StanyJednostek only one (Estimated Number of Execution = 1). In reality, it would scan StanyJednostek 20,802 times (the number of rows in the first result set, see Number of Executions).

Note that the DISTINCT operator is no where to be found yet. It is applied after both joins have been executed. Of course, by then you are dealing with 607,147,974 rows.

Since IndeksID is part of a composite primary key (and not the left-most key either), SQL Server does not keep detailed stats on it alone. Hence the index suggestion.

Edit:

Is this wrong guess due to some out of date statistics? Not likely. The first join matches on JednostkaID. See how the column appears in the PK of both tables. SQL Server may think that because it is in the PKs, it must be unique. This is likely a bug in the Query Optimizer.
Why is SQL Server hoisting the DISTINCT operator? From its guess, it saw that the DISTINCT operator will be applied to 20,802 rows, before or after the join - there's no difference! So my guess is that it just picks one.

Some optimization suggestions:

The SELECT DISTINCT IndeksID subquery is not needed, at all! This probably brings the biggest improvement to performance.
If you really insist on keeping the SELECT DISTINCT for some reason that is not in this question, I would recommend materialize it into a temp table. It forces SQL Server to apply DISTINCT on a smaller set of rows (29,187)
You can force the join order by adding OPTION (FORCE ORDER) to the end of the query. But use this carefully and sparingly.
You can force a has join by INNER HASH JOIN, but again, be mindful of the unwanted effect that aren't immediately visible. Any kind of query hint carries risks.

Remus Rusanu · Answer

The second inner join expands the number of rows, since StanyJednostek.JednostkaID = Jednostki.JednostkaID is N:1. This increases the memory required by the hash join to above your system available, hence the hash join cannot be used.

As for why the missing index hint is gone: because a hash join does not need it. Is very likely that the missing index would improve perf.

Why are nested loops chosen causing long execution time for "self join"

Tags:

sql

sql-server

tsql

sql-server-2008

Kuba Wyrostek

2 Answers

Code Different

Remus Rusanu

Recent Activity

Donate For Us

Why are nested loops chosen causing long execution time for "self join"

Tags:

sql

sql-server

tsql

sql-server-2008

Kuba Wyrostek

2 Answers

Code Different

Remus Rusanu

Related questions

Recent Activity

Donate For Us