Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this simple SQL query be optimized?

Tags:

sql

sql-server

I have the following query:

SELECT COUNT(*) 
FROM Address adr INNER JOIN 
     Audit a on adr.UniqueId = a.UniqueId
  • on a database (1.3 million addresses, more than 4 million audits)
  • both UniqueId columns are clustered primary keys

The query is taking quite long to complete. I feel dumb, but is there any way to optimize it? I want to count all the address entries that have an underlying auditable.

EDIT: all your inputs are much appreciated, here are some more details:

  • The query will not be run often (it is only for validating), but thanks for the indexed view tip, I will add that to my knowledge for sure.
  • All Addresses have an associated 1-to-1 audit. Not all audits are addresses.
  • The query takes more than 1 minute to finish. I find this too long for a simple count.
like image 565
Bruno Avatar asked May 12 '10 13:05

Bruno


1 Answers

Since you have two sets of data, ordered by the same value.. have you tried a merge join instead of the nested loop join?

SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (LOOP JOIN)

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (MERGE JOIN)

SELECT COUNT(*)  
FROM Address adr INNER JOIN  
     Auditable a on adr.UniqueId = a.UniqueId 
OPTION (HASH JOIN)

Edit:

These explanations are conceptual. SQL Server may be doing more sophisticated operations than my examples show. This conceptual understanding, matched with the measuring of time and logical IO by the SET STATISTICS commands, and examination of query execution plans - form the basis of my query optimizing technique (grown over four years). May it serve you as well as it has me.

Setup

  • Get 5 decks of cards.
  • Take 1 deck and produce a parent data set.
  • Take the other 4 decks and produce the child data set.
  • Order each data set by card value.
  • Let m be the number of cards in the parent data set.
  • Let n be the number of cards in the child data set.

NestedLoop

  • Take a card off the top of the parent data set.
  • Search (using binary search) within the child data set for the first occurence of a match.
  • Seek forward in the child data set from the first match until a non-match is found. You've now found all the matches.
  • Repeat this for each card in the parent data set.

The nested loop algorithm iterates the parent data set, and then searches the child data set once for each parent, making it cost: m * log(n)

Merge

  • Take a card off the top of the parent data set.
  • Take a card off the top of the child data set.
  • If the cards match, pull cards from the top of each deck until a non-match is found from each. Produce each matching pair between the parent and child matches.
  • If the cards do not match, find the smaller between the parent and child cards, and take a card off the top of that data set.

The merge join algorithm iterates the parent data set once and the child data set once, making it cost: m + n. It relies on the data being ordered. If you ask for a merge join on un-ordered data, you will incur an ordering operation! This brings the cost to (m * log(m)) + (n * log(n)) + m + n. Even that might be better than nested loop in some cases.

Hash

  • Get a card table.
  • Take each card from the parent data set and place it on the card table where you can find it (does not have to be anything to do with card value, just has to be convenient for you).
  • Take each card from the child data set, locate its matching parent on the cardboard table and produce the matching pair.

The hash join algorithm iterates the parent data set once and the child data set once, making it cost: m + n. It relies on having a big enough card table to hold the entire contents of the parent data set.

like image 116
Amy B Avatar answered Sep 21 '22 04:09

Amy B