Indexing simple query in a huge database

Question

I got a table contains nearly 850,000,000 rows.

The table has the following fields:

[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,

All my queries for this table are quite the same -

Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,

etc....

Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0

Example for a query:

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

How should I index this table for the fastest results?

Thanks a lot

gbn · Accepted Answer

The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is, = before <>.

Rearranging the WHERE clause:

WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0 
--in the middle    
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5

So, first draft is this:

CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)

The order of D4 to D9 should be based on selectivity. Highest numbers first. Hb should always go last in the equality columns because it's bit

SELECT
   COUNT(DISTINCT D4) AS D4COunt,
   COUNT(DISTINCT D5) AS D5COunt,
   COUNT(DISTINCT D6) AS D6COunt,
   COUNT(DISTINCT D7) AS D7COunt,
   COUNT(DISTINCT D8) AS D8COunt,
   COUNT(DISTINCT D9) AS D9COunt,
   COUNT(DISTINCT A) AS ACOunt
FROM
    Mytable

Finally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED

Edit:

An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too

Edit2:

I asked if the = before <> are on the same columns: it appeared "yes". OP's comment to this answer says "no" so everything I've said here is pointless

The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.

Damien_The_Unbeliever · Answer

You may find (if the individual equality/inequality tests are different for the ten columns in each query) that the best you can do is build a narrow index on each column individually, and then hope that the optimizer will apply index intersection, where it will use the indexes on each column where it makes sense to do so.

Quassnoi · Answer

Extending @gbn's answer.

For a table of this size, you definitely need an index which would cover all columns selected.

However, for each column you should decide whether you want it to be a key column or an included column in the index.

To do this, run this query:

SELECT  SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
        SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
        SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
        SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
        SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
        SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
        SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
        SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
        SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
        SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
        SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM    Mytable

You should create a list of the least selective columns (those with the highest values of *Card) which (together) comprise more than 25% of your records.

Say, the selectivity chart on the columns looks like this:

Column  Selectivity  Cumulative selectivity
D4      0.96         0.96
D8      0.87         0.84
D9      0.85         0.70
D7      0.72         0.51
D6      0.65         0.33 -- here
D5      0.20         0.07
A       0.02         0.00
Hb      0.01         0.00

This means that the conditions on columns d4, d8, d9, d7, d6 together match about 33% of your records.

In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.

CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)

The columns with the non-equality filter always go to the INCLUDE section.

Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.

It may also be case that the conditions like [D1] <> 8 involve magic numbers, and there are few records for which this condition holds.

In this case, you can add a computed column into your table's definition:

ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END

and add this expression to the index (with regard to the rules above).

If you do this, you will have to use d1_ne_8 = 1 instead of d1 <> 8.

Indexing simple query in a huge database

Tags:

sql

sql-server

indexing

data-structures

database-design

Shay

3 Answers

gbn

Damien_The_Unbeliever

Quassnoi

Recent Activity

Donate For Us

Indexing simple query in a huge database

Tags:

sql

sql-server

indexing

data-structures

database-design

Shay

3 Answers

gbn

Damien_The_Unbeliever

Quassnoi

Related questions

Recent Activity

Donate For Us