Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is better - SELECT TOP (1) or INNER JOIN?

Tags:

sql

sql-server

Let's say I have following query:

SELECT Id, Name, ForeignKeyId, 
(SELECT TOP (1) FtName FROM ForeignTable WHERE FtId = ForeignKeyId) 
FROM Table

Would that query execute faster if it is written with JOIN:

SELECT Id, Name, ForeignKeyId, FtName
FROM Table t
LEFT OUTER JOIN ForeignTable ft
ON ft.FtId = t.ForeignTableIf

Just curious... also, if JOINs are faster, will it be faster in all cases (tables with lots of columns, large number of rows)?

EDIT: Queries I wrote are just for illustrating concept of TOP (1) vs JOIN. Yes - I know about Query Execution Plan in SQL Server but I'm not looking to optimize single query - I'm trying to understand if there is certain theory behind SELECT TOP (1) vs JOIN and if certain approach is preferred because of speed (not because of personal preference or readability).

EDIT2: I would like to thank Aaron for his detailed answer and encourage to people to check his company's SQL Sentry Plan Explorer free tool he mentioned in his answer.

like image 838
nikib3ro Avatar asked Aug 29 '11 19:08

nikib3ro


1 Answers

Originally, I wrote:

The first version of the query is MUCH less readable to me. Especially since you don't bother aliasing the matched column inside the correlated subquery. JOINs are much clearer.

I still believe and stand by those statements, but I'd like to add to my original response based on the new information added to the question. You asked, are there general rules or theories about what performs better, a TOP (1) or a JOIN, leaving readability and preference aside)? I will re-state as I commented that no, there are no general rules or theories. When you have a specific example, it is very easy to prove what works better. Let's take these two queries, similar to yours but which run against system objects that we can all verify:

-- query 1:

SELECT name,
   (SELECT TOP (1) [object_id] 
        FROM sys.all_sql_modules 
        WHERE [object_id] = o.[object_id]
   )
FROM sys.all_objects AS o;

-- query 2:

SELECT o.name, m.[object_id]
    FROM sys.all_objects AS o
    LEFT OUTER JOIN sys.all_sql_modules AS m
    ON o.[object_id] = m.[object_id];

These return the exact same results (3,179 rows on my system), but by that I mean the same data and the same number of rows. One clue that they're not really the same query (or at least not following the same execution plan) is that the results come back in a different order. While I wouldn't expect a certain order to be maintained or obeyed, because I didn't include an ORDER BY anywhere, I would expect SQL Server to choose the same ordering if they were, in fact, using the same plan.

But they're not. We can see this by inspecting the plans and comparing them. In this case I'll be using SQL Sentry Plan Explorer, a free execution plan analysis tool from my company - you can get some of this information from Management Studio, but other parts are much more readily available in Plan Explorer (such as actual duration and CPU). The top plan is the subquery version, the bottom one is the join. Again, the subquery is on the top, the join is on the bottom:

Execution Plan for subquery version

[click for full size]

Execution Plan tab for join version

[click for full size]

The actual execution plans: 85% of the overall cost of running the two queries is in the subquery version. This means it is more than 5 times as expensive as the join. Both CPU and I/O are much higher with the subquery version - look at all those reads! 6,600+ pages to return ~3,000 rows, whereas the join version returns the data using much less I/O - only 110 pages.

But why? Because the subquery version works essentially like a scalar function, where you're going and grabbing the TOP matching row from the other table, but doing it for every row in the original query. We can see that the operation occurs 3,179 times by looking at the Top Operations tab, which shows number of executions for each operation. Once again, the more expensive subquery version is on top, and the join version follows:

Top Operations tab for subquery version

Top Operations tab for join version

I'll spare you more thorough analysis, but by and large, the optimizer knows what it's doing. State your intent (a join of this type between these tables) and 99% of the time it will work out on its own what is the best underlying way to do this (e.g. execution plan). If you try to out-smart the optimizer, keep in mind that you're venturing into quite advanced territory.

There are exceptions to every rule, but in this specific case, the subquery is definitely a bad idea. Does that mean the proposed syntax in the first query is always a bad idea? Absolutely not. There may be obscure cases where the subquery version works just as well as the join. I can't think that there are many where the subquery will work better. So I would err on the side of the one that is more likely to be as good or better and the one that is more readable. I see no advantages to the subquery version, even if you find it more readable, because it is most likely going to result in worse performance.

In general, I highly advise you to stick to the more readable, self-documenting syntax unless you find a case where the optimizer is not doing it right (and I would bet in 99% of those cases the issue is bad statistics or parameter sniffing, not a query syntax issue). I would suspect that, outside of those cases, the repros you could reproduce where convoluted queries that work better than their more direct and logical equivalents would be quite rare. Your motivation for trying to find those cases should be about the same as your preference for the unintuitive syntax over generally accepted "best practice" syntax.

like image 64
Aaron Bertrand Avatar answered Oct 12 '22 05:10

Aaron Bertrand