Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Type 2 Effective Dated Index Performance with Multiple Dimensions in TSQL

How do composite indices work on effective dated tables?

Utilizing T-SQL, let's say I have a table that is effective dated with an EffectiveStartDate and EffectiveEndDate associated with a product to record historic price fluctuations so my table would take the form:

MyTable := (EffStartDate date, EffEndDate date, ProductID int, ProductPrice money) where EffEndDate = '12/31/9999' when the record is valid presently.

Let's further assume I implement two indices on this table in the form: Clustered on (EffEndDate, EffStartDate, ProductID) Nonclustered on (EffEndDate, ProductID)

From my understanding, the index creation for clustered indices stores the information in a B-tree (potentially B+) ordered by the column specification order of the index creation statement. Therefore, I envision the table sorting on EffEndDate, then EffStartDate, then ProductID. Most of the time I want to query historically out of this table with a query similar to this: select * from MyTable where ProductID = @ProductID and @MyDate between EffStartDate and EffEndDate.

I am trying to visualize how the B-tree actually stores the information related to these three columns. Does it store it as a tuple object like you could find in Python, or does it add more dimensions to the B tree when the index is composite? For instance, for a given EffEndDate does the B-tree have multiple splitting trees relating to EffStartDates and then multiple splitting trees relating to ProductIDs or is each split based on a tuple? This response seems to believe it takes the tuple approach: Question.

If it takes the single dimension approach, I find it difficult to conceptualize how these types of indices provide holistic value for the date range look-up between two columns. For instances, I see it as happening such that, given a date (@MyDate), we can use the EffEndDate component of the index to limit our search to only EffEndDates >= @MyDate, then use the EffStartDate component to limit our search to only EffStartDate <= @MyDate, and then search for the ProductID within this remaining range. Is this how the index would be used?

The problem I foresee with this is that, if we have about 100k products that get updated non-uniformly every week, we would end up utilizing this clustered index to generate a giant set of capable date ranges, and then would have to search each date range for an instance of our desired ProductID. Is there a better index to implement on this type of query?

I believe the nonclustered index exists to quickly search out current ProductID prices as we only need two pieces of the puzzle for this since EffEndDate would be set to '12/31/9999'.

Alternatively, is there a way to implement a multidimensional index spanning two columns to improve query performance in T-SQL?

Thank you!

like image 302
PkmnBugCatcher Avatar asked May 15 '15 19:05

PkmnBugCatcher


2 Answers

This is an application that really calls for a 2D, or spatial, index, as you correctly note, since you are effectively composing two separate inequality searches together. Without jamming your tables into a form where you can use SQL Server's spatial indexes, your options are limited.

The best approach, if it's possible, is to find some sort of business relationship between EffStartDate and EffEndDate. If there is a rule that these values can't be farther apart than a year, for example, then this is something that could be coded in to your WHERE clause to give you additional selectivity on the indexes you might otherwise take large scans over.

Something like:

SELECT *
FROM Table
WHERE @date BETWEEN EffStartDate and EffEndDate
    AND DATEADD(year, -1, @date) < EffStartDate

...where you are adding an additional business constraint to reduce the search space the query needs to traverse.

Two articles that might be of interest to you are these:

Quassnoi's answer to a similar question, which talks about how to force-fit this type of data into a format that can be spatially indexed, and also has a link to his blog that details a recursive CTE method that can be used to accelerate these types of queries without schema modification.

Michael Asher's article on using business knowledge to boost performance over similar types of queries.

like image 136
mwigdahl Avatar answered Oct 18 '22 20:10

mwigdahl


Simulate the real data. Generate large table (the size of the final table should be the same as you expect in real life) with distribution of products and dates as you'd expect in real life. Start with adding three separate independent indexes on products, start date, end date. Try to run the query. Analyze the execution plan. Try other combinations of indexes. Compare plans and performance. If nothing gives acceptable performance, return here with a script that generates sample data and your query.

In my test the optimizer was inner joining results of three independent index seeks.

Create table

plus three independent indexes for each column:

CREATE TABLE [dbo].[Test](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [ProductID] [int] NOT NULL,
    [StartDate] [date] NOT NULL,
    [EndDate] [date] NOT NULL,
 CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_EndDate] ON [dbo].[Test] 
(
    [EndDate] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_ProductID] ON [dbo].[Test] 
(
    [ProductID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_StartDate] ON [dbo].[Test] 
(
    [StartDate] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Generate test data

  • 1M rows in total.
  • Up to 100 different product IDs with uniform distribution.
  • Start dates are within 10,000 days from 2000-01-01 (~27 years time span)
  • End dates are within 1000 days from the start date (duration up to ~3 years)

query:

INSERT INTO Test(ProductID, StartDate, EndDate)
SELECT TOP(1000000)
    CA.ProductID
    ,DATEADD(day, StartOffset, '2000-01-01') AS StartDate
    ,DATEADD(day, StartOffset+DurationDays, '2000-01-01') AS EndDate
FROM
sys.all_objects AS o1
cross join sys.all_objects AS o2
cross apply
(
    SELECT
        cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 100 + 1 as int) AS ProductID
        ,cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 10000 as int) AS StartOffset
        ,cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 1000 as int) AS DurationDays
) AS CA

The query to optimize:

DECLARE @VarDate date = '2004-01-01';
SELECT *
FROM Test
WHERE 
    ProductID = 1
    AND @VarDate >= StartDate
    AND @VarDate <= EndDate
;

It returns ~500 rows.

Execution plan

plan

The server suggested the following index:

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Test] ([ProductID],[StartDate],[EndDate])
INCLUDE ([ID])

but having such index is silly, IMHO.

If you had 1M rows in total and 100K different product IDs, rather than 100; in other words, if searching by specific product ID eliminates vast majority of rows, then the best option is likely to have one index on ProductID and include other columns into it:

CREATE NONCLUSTERED INDEX IX_Product
ON [dbo].[Test] ([ProductID])
INCLUDE ([StartDate],[EndDate])

OR

CREATE NONCLUSTERED INDEX IX_Product
ON [dbo].[Test] ([ProductID], [StartDate])
INCLUDE ([EndDate])

OR

CREATE NONCLUSTERED INDEX IX_Product
ON [dbo].[Test] ([ProductID],[EndDate])
INCLUDE ([StartDate])

If one of the dates gives good selectivity, then have an index on it instead of ProductID.

If none of the columns have good selectivity, then it is tough.

Edit

It is silly to blindly make an index as suggested by the optimizer, because you know that you will search for a particular ProductID, but then for a range of StartDates and then range of EndDates. So, the third column EndDate will never be used for the search itself. It this case it is better to INCLUDE this column in the index, rather than make it part of the index, as I've shown above.

If the query was for particular ProductID and for particular StartDate (not a range) and then for a range of EndDate (or particular EndDate), then having the EndDate as part of the index would help.

like image 29
Vladimir Baranov Avatar answered Oct 18 '22 19:10

Vladimir Baranov