I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT * FROM a WHERE a.c IN (SELECT d FROM b) SELECT a.* FROM a JOIN b ON a.c = b.d
If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server . If it's not, then IN is faster than JOIN on DISTINCT .
In most cases, EXISTS or JOIN will be much more efficient (and faster) than an IN statement.
If all you need is to check for matching rows in the other table but don't need any columns from that table, use IN. If you do need columns from the second table, use Inner Join.
Update:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
SELECT * FROM a WHERE a.c IN (SELECT d FROM b) SELECT a.* FROM a JOIN b ON a.c = b.d
These queries are not equivalent. They can yield different results if your table b
is not key preserved (i. e. the values of b.d
are not unique).
The equivalent of the first query is the following:
SELECT a.* FROM a JOIN ( SELECT DISTINCT d FROM b ) bo ON a.c = bo.d
If b.d
is UNIQUE
and marked as such (with a UNIQUE INDEX
or UNIQUE CONSTRAINT
), then these queries are identical and most probably will use identical plans, since SQL Server
is smart enough to take this into account.
SQL Server
can employ one of the following methods to run this query:
If there is an index on a.c
, d
is UNIQUE
and b
is relatively small compared to a
, then the condition is propagated into the subquery and the plain INNER JOIN
is used (with b
leading)
If there is an index on b.d
and d
is not UNIQUE
, then the condition is also propagated and LEFT SEMI JOIN
is used. It can also be used for the condition above.
If there is an index on both b.d
and a.c
and they are large, then MERGE SEMI JOIN
is used
If there is no index on any table, then a hash table is built on b
and HASH SEMI JOIN
is used.
Neither of these methods reevaluates the whole subquery each time.
See this entry in my blog for more detail on how this works:
There are links for all RDBMS
's of the big four.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With