Disclaimer: this is not a question of what to do to improve performance, but rather why is it bad in first place.
This following query is actually an essence of some bigger query, but small enough to demonstrate the issue I do not understand.
The tables involved are (skipping columns which are - I hope - irrelevant):
create table StanyJednostek (JednostkaID nchar(5), IndeksID nchar(18),
primary key (JednostkaID, IndeksID))
create table Jednostki (JednostkaID nchar(5),
primary key (JednostkaID))
StanyJednostek
contains 29187 rows, while there are 1676 distinct IndeksID
values in this table). Jednostki
contains 94 rows.
And now, this query takes over two minutes to complete:
select
StanyJednostek.JednostkaID, StanyJednostek.IndeksID
from StanyJednostek
inner join
(select distinct IndeksID from StanyJednostek) as Zmiany
on StanyJednostek.IndeksID = Zmiany.IndeksID
inner join
Jednostki on StanyJednostek.JednostkaID = Jednostki.JednostkaID
Here is execution plan:
What bothers me is this huge number of actual rows: 607147974. This obviously takes two minutes to complete. While I understand where this number comes from (this is 29187 times 20802, and 20802 is number of successful joins between StanyJednostek
and Jednostki
), I do not quite get why does query optimizer decide to choose nested loops here? Why is not Zmiany
some kind of temporary set that is iterated over instead of the whole source table? What's also interesting is that while last two lines of the query seems irrelevant, if I remove those lines, execution plan changes and nested loops are replaced with hashes:
select
StanyJednostek.JednostkaID, StanyJednostek.IndeksID
from StanyJednostek
inner join
(select distinct IndeksID from StanyJednostek) as Zmiany
on StanyJednostek.IndeksID = Zmiany.IndeksID
Note that query optimizer also stops suggesting creating additional index on IndeksID
in StanyJednostek
.
Using HASH
hint on either join leads to the following execution plan:
SQL Server reorders the joins to what it thinks is most efficient. In this case, it guesses wrong. Notice from your first execution plan that the join order is as follows:
StanyJednostek
INNER JOIN Jednostki
INNER JOIN (SELECT DISTINCT IndeksID FROM StanyJednostek)
The first join is hardly anything to write home about - 29187 to 94 rows is not a problem. But the Query Optimizer guessed wrong about the result set from this join. It thinks this temporary result set only has 1 row.
Consequently, it picks a Nested Loop and thought it would scan StanyJednostek
only one (Estimated Number of Execution = 1). In reality, it would scan StanyJednostek
20,802 times (the number of rows in the first result set, see Number of Executions).
Note that the DISTINCT
operator is no where to be found yet. It is applied after both joins have been executed. Of course, by then you are dealing with 607,147,974 rows.
Since IndeksID
is part of a composite primary key (and not the left-most key either), SQL Server does not keep detailed stats on it alone. Hence the index suggestion.
Edit:
Is this wrong guess due to some out of date statistics? Not likely. The first join matches on JednostkaID
. See how the column appears in the PK of both tables. SQL Server may think that because it is in the PKs, it must be unique. This is likely a bug in the Query Optimizer.
Why is SQL Server hoisting the DISTINCT
operator? From its guess, it saw that the DISTINCT
operator will be applied to 20,802 rows, before or after the join - there's no difference! So my guess is that it just picks one.
Some optimization suggestions:
The SELECT DISTINCT IndeksID
subquery is not needed, at all! This probably brings the biggest improvement to performance.
If you really insist on keeping the SELECT DISTINCT
for some reason that is not in this question, I would recommend materialize it into a temp table. It forces SQL Server to apply DISTINCT
on a smaller set of rows (29,187)
You can force the join order by adding OPTION (FORCE ORDER)
to the end of the query. But use this carefully and sparingly.
You can force a has join by INNER HASH JOIN
, but again, be mindful of the unwanted effect that aren't immediately visible. Any kind of query hint carries risks.
The second inner join expands the number of rows, since StanyJednostek.JednostkaID = Jednostki.JednostkaID
is N:1. This increases the memory required by the hash join to above your system available, hence the hash join cannot be used.
As for why the missing index hint is gone: because a hash join does not need it. Is very likely that the missing index would improve perf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With