My scenario started off similar to a Island and Gaps problem, where I needed to find consecutive days of work. My current SQL query answers "ProductA was produced at LocationA from DateA through DateB, totaling X quantity".
However, this does not suffice when I needed to throw prices into the mix. Prices are in a separate table and handled in C# after the fact. Price changes are essentially a list of records that say "ProductA from LocationA is now Y value per unit effective DateC".
The end result is it works as long as the island does not overlap with a price-change date, but if it does overlap, I get a "close" answer, but it's not precise.
The C# code can handle applying the prices efficiently, what I need to do though is split the islands based on price changes. My goal is to make the SQL's partioning take into account the ranking of days from the other table, but I'm having trouble applying what I want to do.
The current SQL that generates my island is as follows
SELECT MIN(ScheduledDate) as StartDate, MAX(ScheduledDate) as
EndDate, ProductId, DestinationId, SUM(Quantity) as TotalQuantity
FROM (
SELECT ScheduledDate, DestinationId, ProductId, PartitionGroup = DATEADD(DAY ,-1 * DENSE_RANK() OVER (ORDER BY ScheduledDate), ScheduledDate), Quantity
FROM History
) tmp
GROUP BY PartitionGroup, DestinationId, ProductId;
The current SQL that takes from the PriceChange table and ranks the dates is as follows
DECLARE @PriceChangeDates TABLE(Rank int, SplitDate Date);
INSERT INTO @PriceChangeDates
SELECT DENSE_RANK() over (ORDER BY EffectiveDate) as Rank, EffectiveDate as SplitDate
FROM ProductPriceChange
GROUP BY EffectiveDate;
My thought is to somehow update the first queries inner SELECT statement to somehow take advantage of the @PriceChangeDates table created by the second query. I would think we can multiply the DATEADD's increment parameter by the rank from the declared table, but I am struggling to write it.
If I was to somehow do this with loops, my thought process would be to determine which rank the ScheduledDate would be from the @PriceChangeDates table, where its rank is the rank of the closest Date that is smaller than itself it can find. Then take whatever rank that gives and, I would think, multiply it by the increment parameter being passed in (or some math, for example doing a *@PriceChangeDates.Count() on the existing parameter and then adding in the new rank to avoid collisions). However, that's "loop" logic not "set" logic, and in SQL I need to think in sets.
Any and all help/advice is greatly appreciated. Thank you :)
UPDATE:
Sample data & example on SQLFiddle: http://www.sqlfiddle.com/#!18/af568/1
Where the data is:
CREATE TABLE History
(
ProductId int,
DestinationId int,
ScheduledDate date,
Quantity float
);
INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
(0, 1000, '20180401', 5),
(0, 1000, '20180402', 10),
(0, 1000, '20180403', 7),
(3, 5000, '20180507', 15),
(3, 5000, '20180508', 23),
(3, 5000, '20180509', 52),
(3, 5000, '20180510', 12),
(3, 5000, '20180511', 14);
CREATE TABLE PriceChange
(
ProductId int,
DestinationId int,
EffectiveDate date,
Price float
);
INSERT INTO PriceChange (ProductId, DestinationId, EffectiveDate, Price)
VALUES
(0, 1000, '20180201', 1),
(0, 1000, '20180402', 2),
(3, 5000, '20180101', 5),
(3, 5000, '20180510', 20);
The desired results would be to have a SQL statement that generates the result:
StartDate EndDate ProductId DestinationId TotalQuantity
2018-04-01 2018-04-01 0 1000 5
2018-04-02 2018-04-03 0 1000 17
2018-05-07 2018-05-09 3 5000 90
2018-05-10 2018-05-11 3 5000 26
To clarify, the end result does need the TotalQuantity of each split amount, so the procedural code that manipulates the results and applies the pricing knows how much of each product was one on each side of the price change to accurately determine the values.
Here is one more variant that is likely to perform better than my first answer. I decided to put it as a second answer, because the approach is rather different and the answer would be too long. You should compare performance of all variants with your real data on your hardware, and don't forget about indexes.
In the first variant I was using APPLY
to pick a relevant price for each row in the History
table. For each row from the History
table the engine is searching for a relevant row from the PriceChange
table. Even with appropriate index on the PriceChange
table when this is done via a single seek, it still means 3.7 million seeks in a loop join.
We can simply join History
and PriceChange
tables together and with appropriate indexes on both tables it will be an efficient merge join.
Here I'm also using an extended sample data set to illustrate the gaps. I added these rows to the sample data from the question.
INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
(0, 1000, '20180601', 5),
(0, 1000, '20180602', 10),
(0, 1000, '20180603', 7),
(3, 5000, '20180607', 15),
(3, 5000, '20180608', 23),
(3, 5000, '20180609', 52),
(3, 5000, '20180610', 12),
(3, 5000, '20180611', 14);
Intermediate query
We do a FULL JOIN
here, not a LEFT JOIN
because it is possible that the date on which the price changed doesn't appear in the History
table at all.
WITH
CTE_Join
AS
(
SELECT
ISNULL(History.ProductId, PriceChange.ProductID) AS ProductID
,ISNULL(History.DestinationId, PriceChange.DestinationId) AS DestinationId
,ISNULL(History.ScheduledDate, PriceChange.EffectiveDate) AS ScheduledDate
,History.Quantity
,PriceChange.Price
FROM
History
FULL JOIN PriceChange
ON PriceChange.ProductID = History.ProductID
AND PriceChange.DestinationId = History.DestinationId
AND PriceChange.EffectiveDate = History.ScheduledDate
)
,CTE2
AS
(
SELECT
ProductID
,DestinationId
,ScheduledDate
,Quantity
,Price
,MAX(CASE WHEN Price IS NOT NULL THEN ScheduledDate END)
OVER (PARTITION BY ProductID, DestinationId ORDER BY ScheduledDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
FROM CTE_Join
)
SELECT *
FROM CTE2
ORDER BY
ProductID
,DestinationId
,ScheduledDate
Create the following indexes
CREATE UNIQUE NONCLUSTERED INDEX [IX_History] ON [dbo].[History]
(
[ProductId] ASC,
[DestinationId] ASC,
[ScheduledDate] ASC
)
INCLUDE ([Quantity])
CREATE UNIQUE NONCLUSTERED INDEX [IX_Price] ON [dbo].[PriceChange]
(
[ProductId] ASC,
[DestinationId] ASC,
[EffectiveDate] ASC
)
INCLUDE ([Price])
and the join will be an efficient MERGE
join in the execution plan (not a LOOP
join)
Intermediate result
+-----------+---------------+---------------+----------+-------+------------+
| ProductID | DestinationId | ScheduledDate | Quantity | Price | grp |
+-----------+---------------+---------------+----------+-------+------------+
| 0 | 1000 | 2018-02-01 | NULL | 1 | 2018-02-01 |
| 0 | 1000 | 2018-04-01 | 5 | NULL | 2018-02-01 |
| 0 | 1000 | 2018-04-02 | 10 | 2 | 2018-04-02 |
| 0 | 1000 | 2018-04-03 | 7 | NULL | 2018-04-02 |
| 0 | 1000 | 2018-06-01 | 5 | NULL | 2018-04-02 |
| 0 | 1000 | 2018-06-02 | 10 | NULL | 2018-04-02 |
| 0 | 1000 | 2018-06-03 | 7 | NULL | 2018-04-02 |
| 3 | 5000 | 2018-01-01 | NULL | 5 | 2018-01-01 |
| 3 | 5000 | 2018-05-07 | 15 | NULL | 2018-01-01 |
| 3 | 5000 | 2018-05-08 | 23 | NULL | 2018-01-01 |
| 3 | 5000 | 2018-05-09 | 52 | NULL | 2018-01-01 |
| 3 | 5000 | 2018-05-10 | 12 | 20 | 2018-05-10 |
| 3 | 5000 | 2018-05-11 | 14 | NULL | 2018-05-10 |
| 3 | 5000 | 2018-06-07 | 15 | NULL | 2018-05-10 |
| 3 | 5000 | 2018-06-08 | 23 | NULL | 2018-05-10 |
| 3 | 5000 | 2018-06-09 | 52 | NULL | 2018-05-10 |
| 3 | 5000 | 2018-06-10 | 12 | NULL | 2018-05-10 |
| 3 | 5000 | 2018-06-11 | 14 | NULL | 2018-05-10 |
+-----------+---------------+---------------+----------+-------+------------+
You can see that the Price
column has a lot of NULL
values. We need to "fill" these NULL values with the preceding non-NULL value.
Itzik Ben-Gan wrote a nice article showing how to solve this efficiently The Last non NULL Puzzle. Also see Best way to replace NULL with most recent non-null value.
This is done in CTE2
using MAX
window function and you can see how it populates the grp
column. This requires SQL Server 2012+. After the groups are determined we should remove rows where Quantity
is NULL, because these rows are not from the History
table.
Now we can do the same gaps-and-islands step using the grp
column as an additional partitioning.
The rest of the query is pretty much the same as in the first variant.
Final query
WITH
CTE_Join
AS
(
SELECT
ISNULL(History.ProductId, PriceChange.ProductID) AS ProductID
,ISNULL(History.DestinationId, PriceChange.DestinationId) AS DestinationId
,ISNULL(History.ScheduledDate, PriceChange.EffectiveDate) AS ScheduledDate
,History.Quantity
,PriceChange.Price
FROM
History
FULL JOIN PriceChange
ON PriceChange.ProductID = History.ProductID
AND PriceChange.DestinationId = History.DestinationId
AND PriceChange.EffectiveDate = History.ScheduledDate
)
,CTE2
AS
(
SELECT
ProductID
,DestinationId
,ScheduledDate
,Quantity
,Price
,MAX(CASE WHEN Price IS NOT NULL THEN ScheduledDate END)
OVER (PARTITION BY ProductID, DestinationId ORDER BY ScheduledDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
FROM CTE_Join
)
,CTE_RN
AS
(
SELECT
ProductID
,DestinationId
,ScheduledDate
,grp
,Quantity
,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, grp ORDER BY ScheduledDate) AS rn1
,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
FROM CTE2
WHERE Quantity IS NOT NULL
)
SELECT
ProductId
,DestinationId
,MIN(ScheduledDate) AS StartDate
,MAX(ScheduledDate) AS EndDate
,SUM(Quantity) AS TotalQuantity
FROM
CTE_RN
GROUP BY
ProductId
,DestinationId
,grp
,rn2-rn1
ORDER BY
ProductID
,DestinationId
,StartDate
;
Final result
+-----------+---------------+------------+------------+---------------+
| ProductId | DestinationId | StartDate | EndDate | TotalQuantity |
+-----------+---------------+------------+------------+---------------+
| 0 | 1000 | 2018-04-01 | 2018-04-01 | 5 |
| 0 | 1000 | 2018-04-02 | 2018-04-03 | 17 |
| 0 | 1000 | 2018-06-01 | 2018-06-03 | 22 |
| 3 | 5000 | 2018-05-07 | 2018-05-09 | 90 |
| 3 | 5000 | 2018-05-10 | 2018-05-11 | 26 |
| 3 | 5000 | 2018-06-07 | 2018-06-11 | 116 |
+-----------+---------------+------------+------------+---------------+
This variant doesn't output the relevant price (as the first variant), because I simplified the "last non-null" query. It wasn't required in the question. In any case, it is pretty easy to add the price if needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With