Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server why is index not used with OR

I've been studying indexes and trying to understand how they work and how I can use them to boost performance, but I'm missing something.

I have the following table:

Person:

| Id | Name | Email | Phone |
| 1  | John |  E1   |  P1   |
| 2  | Max  |  E2   |  P2   |

I'm trying to find the best way to index the columns Email and Phone considering that the queries will (most of the time) be of the form

[1] SELECT * FROM Person WHERE Email = '...' OR Phone = '...'
[2] SELECT * FROM Person WHERE Email = ...
[3] SELECT * FROM Person WHERE Phone = ...

I thought the best approach would be to create a single index using both columns:

CREATE NONCLUSTERED INDEX [IX_EmailPhone]
ON [dbo].[Person]([Email], [PhoneNumber]);

However, with the index above, only the query [2] benefits from an index seek, the others use index scan.

I also tried to create multiple index: one with both columns, one for email, and one for email. In this case, [2] and [3] use seek, but [1] continues to use scan.

Why can't the database use index with an or? What would be the best indexing approach for this table considering the queries?

like image 790
victor Avatar asked Nov 29 '16 19:11

victor


2 Answers

Use two separate indexes, one on (email) and one on (phone, email).

The OR is rather difficult. If your conditions were connected by AND rather than OR, then your index would be used for the first query (but not the third, because phone is not the first key in the index).

You can write the query as:

SELECT *
FROM Person 
WHERE Email = '...' 
UNION ALL
SELECT *
FROM Person 
WHERE Email <> '...' AND Phone = '...';

SQL Server should use the appropriate index for each subquery.

like image 86
Gordon Linoff Avatar answered Oct 09 '22 19:10

Gordon Linoff


Create a separate index for each column.
By using hints we can force the optimizer to use/not use the indexes, so you can check the execution plan, get a feeling of the performance involved and understand the meaning of each path.

Go through my demo and consider the work involved in each path for the following scenarios -

  1. Only few rows satisfy the condition j=123.
    Only few rows satisfy the condition k=456.

  2. Most of the rows satisfy the condition j=123.
    Most of the rows satisfy the condition k=456.

  3. Only few rows satisfy the condition j=123.
    Most of the rows satisfy the condition k=456.

Try to think what path you would have chosen for each scenario.
Please feel free to ask questions.


Demo

;with t(n) as (select 0 union all select n+1 from t where n < 999)

select      1+t0.n+1000*t1.n                                as i
           ,floor(rand(cast (newid() as varbinary))*1000)   as j
           ,floor(rand(cast (newid() as varbinary))*1000)   as k 

into        t

from        t t0,t t1 

option       (maxrecursion 0)
;

create index t_j on t (j);
create index t_k on t (k);

update statistics t (t_j)
update statistics t (t_k)

Scan

select      *
from        t (forcescan)
where       j = 123
        or  k = 456
  • This is straightforward.

enter image description here

Seek

select      *
from        t (forceseek)
where       j = 123
        or  k = 456
  • "Index Seek": Each index is being seeked for the relevant values (123 and 456)
  • "Merge Join": The results (row IDs) are being concatenated (as in UNION ALL)
  • "Stream Aggregate": Duplicate row IDs are being eliminated
  • "Rid Lookup" + "Nested Loops": The row IDs are being used to retrieve the rows from the table (t)

enter image description here

like image 45
David דודו Markovitz Avatar answered Oct 09 '22 17:10

David דודו Markovitz