I got a table contains nearly 850,000,000 rows.
The table has the following fields:
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,
All my queries for this table are quite the same -
Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,
etc....
Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0
Example for a query:
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2
AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
How should I index this table for the fastest results?
Thanks a lot
The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is, =
before <>
.
Rearranging the WHERE clause:
WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0
--in the middle
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5
So, first draft is this:
CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)
The order of D4
to D9
should be based on selectivity. Highest numbers first. Hb
should always go last in the equality columns because it's bit
SELECT
COUNT(DISTINCT D4) AS D4COunt,
COUNT(DISTINCT D5) AS D5COunt,
COUNT(DISTINCT D6) AS D6COunt,
COUNT(DISTINCT D7) AS D7COunt,
COUNT(DISTINCT D8) AS D8COunt,
COUNT(DISTINCT D9) AS D9COunt,
COUNT(DISTINCT A) AS ACOunt
FROM
Mytable
Finally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED
Edit:
An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too
Edit2:
I asked if the =
before <>
are on the same columns: it appeared "yes".
OP's comment to this answer says "no" so everything I've said here is pointless
The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.
You may find (if the individual equality/inequality tests are different for the ten columns in each query) that the best you can do is build a narrow index on each column individually, and then hope that the optimizer will apply index intersection, where it will use the indexes on each column where it makes sense to do so.
Extending @gbn's answer.
For a table of this size, you definitely need an index which would cover all columns selected.
However, for each column you should decide whether you want it to be a key column or an included column in the index.
To do this, run this query:
SELECT SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM Mytable
You should create a list of the least selective columns (those with the highest values of *Card
) which (together) comprise more than 25%
of your records.
Say, the selectivity chart on the columns looks like this:
Column Selectivity Cumulative selectivity
D4 0.96 0.96
D8 0.87 0.84
D9 0.85 0.70
D7 0.72 0.51
D6 0.65 0.33 -- here
D5 0.20 0.07
A 0.02 0.00
Hb 0.01 0.00
This means that the conditions on columns d4, d8, d9, d7, d6
together match about 33%
of your records.
In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.
CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)
The columns with the non-equality filter always go to the INCLUDE
section.
Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.
It may also be case that the conditions like [D1] <> 8
involve magic numbers, and there are few records for which this condition holds.
In this case, you can add a computed column into your table's definition:
ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END
and add this expression to the index (with regard to the rules above).
If you do this, you will have to use d1_ne_8 = 1
instead of d1 <> 8
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With