Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexes for inner joins with where clause

If I had the following query:

select some cols 
   from tbl_a
INNER JOIN tbl_b ON tbl_a.orderNumber = tbl_b.orderNumber
   where tlb_b.status = 'XX'

Assuming both tables have clustered indexes on order number only, would it be better from a performance perspective to extend the clustered index on table b to include the status column referenced in the where clause?

like image 958
SuperCoolMoss Avatar asked Jan 04 '10 21:01

SuperCoolMoss


2 Answers

  1. You extend tbl_b to add status after the orderNumber: create clustered index ... on tbl_b(orderNumber, status). For the query above there will be no noticeable difference. The plan will still have to scan tbl_b end to end and match every order number in tbl_a (probably a merge join).

  2. You extend tbl_b to add status before the orderNumber: create clustered index ... on tbl_b (status, orderNumber). Now there is a HUGE difference. The plan can do a range scan on tbl_b to get only those with Status 'xx' and only match tbl_a for the corrsponding orderNumber, using a nested loop join.

Placing a low selectivity column (like 'status' usually is) as the leftmost key in an index is usually a good thing. And making a row like 'status' the leftmost column in a clustered index is also usually a good thing, because it groups records with same status together physically. Note that doing so will have an impact on all queries. You also loose the direct access by orderNumber if status is not specified, you'll have to add a non-clustered index on orderNumber alone to cover that (which is usualy the PK non-clustered index).

I made all these comments w/o knowing your actual data cardinality and selectivity. If the cardinality of tbl_a and tbl_b is very skewed then things may be different. Eg. if tbl_a has 10 records with 10 distinct order numbers and tbl_b has 10M records with 10M order numbers than my advice the option 2. would make little difference, since the plan will always choose a scan of tbl_a a seek range lookups in tbl_b 10 times.

like image 196
Remus Rusanu Avatar answered Sep 19 '22 16:09

Remus Rusanu


Yes, quite possibly. This is called a covering index. The entire query can be served from the index, without accessing tbl_b at all.

However, you should consider the impact on performance of other queries, particularly ones that update the status column.

like image 33
David M Avatar answered Sep 19 '22 16:09

David M