Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate for each day over time series, without using non-equijoin logic

Initial Question

Given the following dataset paired with a dates table:

MembershipId | ValidFromDate | ValidToDate
==========================================
0001         | 1997-01-01    | 2006-05-09
0002         | 1997-01-01    | 2017-05-12
0003         | 2005-06-02    | 2009-02-07

How many Memberships were open on any given day or timeseries of days?

Initial Answer

Following this question being asked here, this answer provided the necessary functionality:

select d.[Date]
      ,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
    left join Memberships as m
        on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];

though a commenter remarked that There are other approaches when the non-equijoin takes too long.

Followup

As such, what would the equijoin only logic look like to replicate the output of the query above?


Progress So Far

From the answers provided so far I have come up with the below, which outperforms on the hardware I am working with across 3.2 million Membership records:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
)
,e as
(
    select d.[Date] as d
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
),c as
(
    select isnull(s.d,e.d) as d
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
)
select d.[Date]
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;

Following on from that, to split this aggregate into constituent groups per day I have the following, which is also performing well:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,s.MembershipGrouping as g
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
            ,s.MembershipGrouping
)
,e as
(
    select d.[Date] as d
        ,e..MembershipGrouping as g
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
            ,e.MembershipGrouping
),c as
(
    select isnull(s.d,e.d) as d
            ,isnull(s.g,e.g) as g
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
                and s.g = e.g
)
select d.[Date]
    ,c.g
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
        ,c.g
;

Can anyone improve on the above?

like image 855
iamdave Avatar asked Mar 27 '18 09:03

iamdave


People also ask

Can we use aggregate function without GROUP BY?

GROUP BY in SQL, Explained And data aggregation is impossible without GROUP BY! Therefore, it is important to master GROUP BY to easily perform all types of data transformations and aggregations. In SQL, GROUP BY is used for data aggregation, using aggregate functions.

What are the 5 aggregate functions?

There are five aggregate functions, which are: MIN, MAX, COUNT, SUM, and AVG.

Which aggregate functions work well with dates?

You can use the date and time data types with the MIN() , MAX() , COUNT() functions, the DISTINCT argument to those functions, and the GROUP BY argument to the SELECT() function.


1 Answers

If most of your membership validity intervals are longer than few days, have a look at an answer by Martin Smith. That approach is likely to be faster.


When you take calendar table (DIM.[Date]) and left join it with Memberships, you may end up scanning the Memberships table for each date of the range. Even if there is an index on (ValidFromDate, ValidToDate), it may not be super useful.

It is easy to turn it around. Scan the Memberships table only once and for each membership find those dates that are valid using CROSS APPLY.

Sample data

DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);

INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');

DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo   date = '2006-12-31';

Query 1

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= Memberships.ValidFromDate
            AND dbo.Calendar.dt <= Memberships.ValidToDate
            AND dbo.Calendar.dt >= @RangeFrom
            AND dbo.Calendar.dt <= @RangeTo
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE);

OPTION(RECOMPILE) is not really needed, I include it in all queries when I compare execution plans to be sure that I'm getting the latest plan when I play with the queries.

When I looked at the plan of this query I saw that the seek in the Calendar.dt table was using only ValidFromDate and ValidToDate, the @RangeFrom and @RangeTo were pushed to the residue predicate. It is not ideal. The optimiser is not smart enough to calculate maximum of two dates (ValidFromDate and @RangeFrom) and use that date as a starting point of the seek.

seek 1

It is easy to help the optimiser:

Query 2

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;

In this query the seek is optimal and doesn't read dates that may be discarded later.

seek 2

Finally, you may not need to scan the whole Memberships table. We need only those rows where the given range of dates intersects with the valid range of the membership.

Query 3

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
WHERE
    Memberships.ValidToDate >= @RangeFrom
    AND Memberships.ValidFromDate <= @RangeTo
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;

Two intervals [a1;a2] and [b1;b2] intersect when

a2 >= b1 and a1 <= b2

These queries assume that Calendar table has an index on dt.

You should try and see what indexes are better for the Memberships table. For the last query, if the table is rather large, most likely two separate indexes on ValidFromDate and on ValidToDate would be better than one index on (ValidFromDate, ValidToDate).

You should try different queries and measure their performance on the real hardware with real data. Performance may depend on the data distribution, how many memberships there are, what are their valid dates, how wide or narrow is the given range, etc.

I recommend to use a great tool called SQL Sentry Plan Explorer to analyse and compare execution plans. It is free. It shows a lot of useful stats, such as execution time and number of reads for each query. The screenshots above are from this tool.

like image 153
Vladimir Baranov Avatar answered Nov 15 '22 21:11

Vladimir Baranov