Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Sever Query Join Optimization

I have looked for answers online but cant find a definitive answer. For example you have 2 join clauses:

1.

JOIN T2 ON T1.[ID] = T2.[ID]

2.

JOIN T2 ON T1.[ID] = REPLACE(T2.[ID],'A', '')

Now the 2nd one performs worse due to the function on the join clause. What is the exact reason for this?

And for example, if this code was in a stored procedure what would be best to optimise it? To remove the replace function and add it to the table level so all of this is completed before any joins?

Any advice or links to further information would be great. Thanks

like image 268
Ryan Gadsdon Avatar asked Mar 09 '26 16:03

Ryan Gadsdon


2 Answers

In your second example, you are attempting to find a record in T2 - but instead of the value being the T1.ID value, you are applying a function to T2.ID - REPLACE(T2.[ID],'A', '')

If you had an index on T2.ID - at best it would scan the index and not seek it - thus causing a performance difference.

This is where it get's harder to explain - the index is stored as a b+tree, of the values for the T2.ID on the table. The index understands that field and can search / sort by it, but it doesn't understand any logic applied to it.

It does not know if REPLACE('A123','A', '') = 123 - without in executing the function on the value in the index and checking the resulting equality.

AAA123 would also be equal, 1A23, 12A3, 123A etc, there is a never ending amount of combinations that would in fact match - but the only way in which it can figure out if a single index entry matches is to run that value through the function and then check the equality.

If it can only figure that out when running the index value through the function - it can only properly answer the query correctly if it does that for every entry in the index - e.g. an index scan of every entry, being passed into the function and the output being checked.

As Jeroen mentions the term is SARGable or SARGability, Search ARGumentABLE, although I personally prefer to explain it as Seek ARGumentABLE since that is a closer match to the query plan operator.

It should be noted that this concept has nothing to do with it being a join, any predicate within SQL has this restriction - a single table query with a where predicate can have the same issue.

Can this problem be avoided? It can but only in some instances, where you can reverse the operation.

Consider a table with an ID column, I could construct a predicate such as this : WHERE ID * 2 = @paramValue

The only way SQL Server would know if an ID entry multiplied by 2 is the passed in value is to process every entry, double it and check. So that is the index scan scenario again.

In this instance we can re-write it: WHERE ID = @paramValue / 2.0

Now SQL Server will perform the mathematics once, divide the passed in value and it can then check that against the index in a seekable manner. The difference in the SQL written looks a potentially trivial difference of stating the problem, but makes a very large difference to how the database can resolve the predicate.

like image 80
Andrew Avatar answered Mar 12 '26 06:03

Andrew


SQL Server has four basic methods for handling joins (as do other databases):

  1. Nested loop without an index. This is like two nested for loops and is usually the slowest method.
  2. Index looped (nested loop with an index). This a scan of one table with a lookup in the second.
  3. Merge join. This assumes that the two tables are ordered and loops through the two tables at the same time (this can also be accomplished using indexes).
  4. Hash join. The keys for the two tables are hashed and hash-tables are used for matching.

In general, the first of these is the slowestthe second of these -- using an index -- is the fastest. (There are exceptions). The second is often the fastest.

When you use an equality comparison between two columns in the table, SQL Server has a lot of information for deciding on the best join algorithm to use:

  • It has information on indexes.
  • It has statistics on the column.

Without this information, SQL Server often defaults to the nested-loop join. I find that it does this even when it could use the expression for a merge- or hash- based join.

As a note, you can work around this by using a computed column:

alter table t2 add id_no_a as (replace(id, 'A', '')) persisted;

create index idx_t2_id_no_a on t2(id_no_a);

Then phrase

on T1.[ID] = t2.id_no_a
like image 29
Gordon Linoff Avatar answered Mar 12 '26 05:03

Gordon Linoff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!