Given the following dataset paired with a dates table:
MembershipId | ValidFromDate | ValidToDate
==========================================
0001 | 1997-01-01 | 2006-05-09
0002 | 1997-01-01 | 2017-05-12
0003 | 2005-06-02 | 2009-02-07
How many Memberships
were open on any given day or timeseries of days?
Following this question being asked here, this answer provided the necessary functionality:
select d.[Date]
,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
left join Memberships as m
on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];
though a commenter remarked that There are other approaches when the non-equijoin takes too long.
As such, what would the equijoin only logic look like to replicate the output of the query above?
From the answers provided so far I have come up with the below, which outperforms on the hardware I am working with across 3.2 million Membership
records:
declare @s date = '20160101';
declare @e date = getdate();
with s as
(
select d.[Date] as d
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
)
,e as
(
select d.[Date] as d
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
),c as
(
select isnull(s.d,e.d) as d
,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
)
select d.[Date]
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;
Following on from that, to split this aggregate into constituent groups per day I have the following, which is also performing well:
declare @s date = '20160101';
declare @e date = getdate();
with s as
(
select d.[Date] as d
,s.MembershipGrouping as g
,count(s.MembershipID) as s
from dbo.Dates as d
join dbo.Memberships as s
on d.[Date] = s.ValidFromDateKey
group by d.[Date]
,s.MembershipGrouping
)
,e as
(
select d.[Date] as d
,e..MembershipGrouping as g
,count(e.MembershipID) as e
from dbo.Dates as d
join dbo.Memberships as e
on d.[Date] = e.ValidToDateKey
group by d.[Date]
,e.MembershipGrouping
),c as
(
select isnull(s.d,e.d) as d
,isnull(s.g,e.g) as g
,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
from s
full join e
on s.d = e.d
and s.g = e.g
)
select d.[Date]
,c.g
,c.c
from dbo.Dates as d
left join c
on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
,c.g
;
Can anyone improve on the above?
GROUP BY in SQL, Explained And data aggregation is impossible without GROUP BY! Therefore, it is important to master GROUP BY to easily perform all types of data transformations and aggregations. In SQL, GROUP BY is used for data aggregation, using aggregate functions.
There are five aggregate functions, which are: MIN, MAX, COUNT, SUM, and AVG.
You can use the date and time data types with the MIN() , MAX() , COUNT() functions, the DISTINCT argument to those functions, and the GROUP BY argument to the SELECT() function.
If most of your membership validity intervals are longer than few days, have a look at an answer by Martin Smith. That approach is likely to be faster.
When you take calendar table (DIM.[Date]
) and left join it with Memberships
, you may end up scanning the Memberships
table for each date of the range. Even if there is an index on (ValidFromDate, ValidToDate)
, it may not be super useful.
It is easy to turn it around.
Scan the Memberships
table only once and for each membership find those dates that are valid using CROSS APPLY
.
Sample data
DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);
INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');
DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo date = '2006-12-31';
Query 1
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >= Memberships.ValidFromDate
AND dbo.Calendar.dt <= Memberships.ValidToDate
AND dbo.Calendar.dt >= @RangeFrom
AND dbo.Calendar.dt <= @RangeTo
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE);
OPTION(RECOMPILE)
is not really needed, I include it in all queries when I compare execution plans to be sure that I'm getting the latest plan when I play with the queries.
When I looked at the plan of this query I saw that the seek in the Calendar.dt
table was using only ValidFromDate
and ValidToDate
, the @RangeFrom
and @RangeTo
were pushed to the residue predicate. It is not ideal. The optimiser is not smart enough to calculate maximum of two dates (ValidFromDate
and @RangeFrom
) and use that date as a starting point of the seek.
It is easy to help the optimiser:
Query 2
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > @RangeFrom
THEN Memberships.ValidFromDate
ELSE @RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < @RangeTo
THEN Memberships.ValidToDate
ELSE @RangeTo END
) AS CA
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
In this query the seek is optimal and doesn't read dates that may be discarded later.
Finally, you may not need to scan the whole Memberships
table.
We need only those rows where the given range of dates intersects with the valid range of the membership.
Query 3
SELECT
CA.dt
,COUNT(*) AS MembershipCount
FROM
@T AS Memberships
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >=
CASE WHEN Memberships.ValidFromDate > @RangeFrom
THEN Memberships.ValidFromDate
ELSE @RangeFrom END
AND dbo.Calendar.dt <=
CASE WHEN Memberships.ValidToDate < @RangeTo
THEN Memberships.ValidToDate
ELSE @RangeTo END
) AS CA
WHERE
Memberships.ValidToDate >= @RangeFrom
AND Memberships.ValidFromDate <= @RangeTo
GROUP BY
CA.dt
ORDER BY
CA.dt
OPTION(RECOMPILE)
;
Two intervals [a1;a2]
and [b1;b2]
intersect when
a2 >= b1 and a1 <= b2
These queries assume that Calendar
table has an index on dt
.
You should try and see what indexes are better for the Memberships
table.
For the last query, if the table is rather large, most likely two separate indexes on ValidFromDate
and on ValidToDate
would be better than one index on (ValidFromDate, ValidToDate)
.
You should try different queries and measure their performance on the real hardware with real data. Performance may depend on the data distribution, how many memberships there are, what are their valid dates, how wide or narrow is the given range, etc.
I recommend to use a great tool called SQL Sentry Plan Explorer to analyse and compare execution plans. It is free. It shows a lot of useful stats, such as execution time and number of reads for each query. The screenshots above are from this tool.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With