IN vs. JOIN with large rowsets

Tags:

I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?

SELECT * FROM a WHERE a.c IN (SELECT d FROM b)  SELECT a.* FROM a JOIN b ON a.c = b.d

768

asked Jun 16 '09 13:06

macleojw

1 Answers

Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:

IN vs. JOIN vs. EXISTS

SELECT  * FROM    a WHERE   a.c IN (SELECT d FROM b)  SELECT  a.* FROM    a JOIN    b ON      a.c = b.d

These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:

SELECT  a.* FROM    a JOIN    (         SELECT  DISTINCT d         FROM    b         ) bo ON      a.c = bo.d

If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:

If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.

Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:

Counting missing rows: SQL Server

There are links for all RDBMS's of the big four.

136

answered Sep 16 '22 15:09

Quassnoi

Related questions
                            
                                Using CASE in PostgreSQL to affect multiple columns at once
                            
                                Insert record only if record does not already exist in table
                            
                                PostgreSQL 9.1: How to concatenate rows in array without duplicates, JOIN another table
                            
                                Query times out from web app but runs fine from management studio
                            
                                What are some online websites to compile and run PL/SQL? [closed]
                            
                                Mysql, reshape data from long / tall to wide
                            
                                mysql change all values in a column
                            
                                How much real storage is used with a varchar(100) declaration in mysql?
                            
                                Alternative to except in MySQL
                            
                                SQL & PHP - Which is faster mysql_num_rows() or 'select count()'?
                            
                                Join two spreadsheets on a common column in Excel or OpenOffice
                            
                                How to sort by count with postgresql?
                            
                                Creating a sequence on an existing table
                            
                                SQLite auto-increment non-primary key field
                            
                                How to optimise this MySQL query? Millions of Rows
                            
                                Update multiple values in a single statement
                            
                                Getting offset of datetimeoffset in SQL Server
                            
                                ERROR: functions in index expression must be marked IMMUTABLE in Postgres
                            
                                Pivoting rows into columns dynamically in Oracle
                            
                                Count distinct value pairs in multiple columns in SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

IN vs. JOIN with large rowsets

Tags:

performance

sql

join

sql-server-2005

macleojw

People also ask

1 Answers

Quassnoi

Recent Activity

Donate For Us