Problem:
I have time series data for vouchers being earned and redeemed. Vouchers only last for a 3 day fixed window including the day they were earnt eg. a voucher earnt on 1st Jan would be active for 1st, 2nd & 3rd.
I need to make an assumption that it's first come first served when it comes to redemptions happening against them being earned. E.g. if we have the data
Date VouchersEarned VouchersRedeemed
01/01/2020 10 0
02/01/2020 8 9
03/01/2020 4 4
04/01/2020 2 4
05/01/2020 1 4
then on 2nd, those 9 vouchers are from the 1st ie we have remaining vouchers
Date VouchersEarned VouchersRedeemed RemainingVouchers
01/01/2020 10 0 1
02/01/2020 8 9 8
then on the 3rd, those 4 redemptions will be the remaining 1 from the 1st and 3 from the 2nd
Date VouchersEarned VouchersRedeemed RemainingVouchers
01/01/2020 10 0 0
02/01/2020 8 9 5
03/01/2020 4 4 0
the 4th:
Date VouchersEarned VouchersRedeemed RemainingVouchers
01/01/2020 10 0 0
02/01/2020 8 9 1
03/01/2020 4 4 4
04/01/2020 2 4 2
the 5th:
Date VouchersEarned VouchersRedeemed RemainingVouchers
01/01/2020 10 0 0
02/01/2020 8 9 1**
03/01/2020 4 4 0
04/01/2020 2 4 2
05/01/2020 1 4 1
So on the 5th we have our first expired voucher as the ** one wasn't used up in it's 3 day period.
I need to calculate at any point in time how many vouchers there are that expire using this logic flow. I can figure out how to do it in my head, going through line by line like above. But I'm struggling to see how to do it in a vectorised way, that could also work in MSSQL. If it was just in R I could brute force it with a loop. I have also posted this on Cross Validated but had no feedback yet so extending this to the programming community.
Data:
I have included an R dput at the end but here is what it looks like
Edit: There is some new data at the end of this summary that's a more tricky example
Date VouchersEarned VouchersRedeemed ActualActive ActualExpired CumulativeEarned
1 01/01/2020 10 0 10 0 10
2 02/01/2020 8 9 9 0 18
3 03/01/2020 4 4 9 0 22
4 04/01/2020 2 4 7 0 24
5 05/01/2020 1 4 3 1 25
6 06/01/2020 0 1 2 0 25
7 07/01/2020 0 1 0 1 25
8 08/01/2020 0 0 0 0 25
CumulativeRedeemed CumulativeDiff
1 0 10
2 9 9
3 13 9
4 17 7
5 21 4
6 22 3
7 23 2
8 23 2
ActualActive & ActualExpired are the numbers I get to with pen and paper. Note that I put them on the date they expired, rather than against the date they were earned. Either would work for me, just changes the reporting view. I can get the total vouchers in play by looking at the CumulativeEarned & CumulativeRedeemed and then taking the difference. I think then if I can just get the Expired ones, then calculating the active is simple.
If anyone has any ideas I'd really appreciate it as I seem to have a mental block on this today! Thank you! :)
Edit: My actual problem is a 28 day window, this is a simplified view :)
df <- structure(list(Date = c("01/01/2020", "02/01/2020", "03/01/2020",
"04/01/2020", "05/01/2020", "06/01/2020", "07/01/2020", "08/01/2020"
), VouchersEarned = c(10L, 8L, 4L, 2L, 1L, 0L, 0L, 0L), VouchersRedeemed = c(0L,
9L, 4L, 4L, 4L, 1L, 1L, 0L), ActualActive = c(10L, 9L, 9L, 7L,
3L, 2L, 0L, 0L), ActualExpired = c(0L, 0L, 0L, 0L, 1L, 0L, 1L,
0L), CumulativeEarned = c(10L, 18L, 22L, 24L, 25L, 25L, 25L,
25L), CumulativeRedeemed = c(0L, 9L, 13L, 17L, 21L, 22L, 23L,
23L), CumulativeDiff = c(10L, 9L, 9L, 7L, 4L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
Edit 2: My latest attempt in R. Something is off though, but I feel like it's not impossible with the right combination of lagged columns
library(data.table)
dt <- as.data.table(df)
dt[, Date := lubridate::dmy(Date)]
# functional form
findExpiredVouchers <- function(dt, period=3){
# generation of cumulative data
dt[, CumulativeEarned := cumsum(VouchersEarned)]
dt[, CumulativeRedeemed := cumsum(VouchersRedeemed)]
dt[, CumulativeDiff := CumulativeEarned-CumulativeRedeemed]
# I think if we look at the cumulative earned against the cumulative redeemed,
# plus it's lag period from that point, ie the cumulative redeemed in 2 days,
# then we can see for data prior to last 3 which have expired
dt[, LaggedCumulativeRedeemed := shift(CumulativeRedeemed, period-1, type="lead")]
dt[, ExpiredCumulative := CumulativeEarned - LaggedCumulativeRedeemed]
# Now this creates negative values though for eg the first case, I'm not 100%
# if I need to worry about these
dt[ExpiredCumulative < 0, ExpiredCumulative := 0]
# so now it should be the difference in this series that captures the origin
# of an expiration
dt[, Expired := c(NA, diff(ExpiredCumulative))]
dt[1, Expired := ExpiredCumulative]
# and I can shift this by the lag period to get the end result
dt[, OutputExpired := shift(Expired, period, type="lag")]
dt[is.na(OutputExpired), OutputExpired := 0]
# and active
dt[, CumulativeExpired := cumsum(OutputExpired)]
dt[, OutputActive := CumulativeDiff-CumulativeExpired]
}
dt <- findExpiredVouchers(dt, 3)
dt[] # OutputExpired & OutputActive
With some new fake data, flaws are exposed as negative expirations come out :(
set.seed(1)
p = 0.2
new_dt <- data.table(
Date = 1:10,
VouchersEarned = sample(0:15, 10, replace=TRUE)
)
new_dt[, CumulativeEarned := cumsum(VouchersEarned)]
# fake VouchersRedeemed
new_dt[, VouchersRedeemed := as.integer(NA)]
new_dt[, CumulativeDiff := CumulativeEarned]
for(i in 1:nrow(new_dt)){
new_value <- sum(rbinom(new_dt$CumulativeDiff[i], 1, p))
new_dt[i, VouchersRedeemed := new_value]
new_dt[i:.N, CumulativeDiff := CumulativeDiff - new_value]
}
new_dt <- findExpiredVouchers(dt=new_dt, 3)
new_dt[] # OutputExpired < 0
new example
Date VouchersEarned VouchersRedeemed OutputExpired
1: 1 8 1 0
2: 2 3 1 0
3: 3 6 3 0
4: 4 0 1 3
5: 5 1 1 2
6: 6 12 5 5
7: 7 6 5 -5
8: 8 10 3 -4
9: 9 13 7 9
10: 10 1 8 -1
Running Waldi's loop shows a similar result but the -5, -4 and 9 cancel to be 0's (as they should!)
Using the tidyverse libraries, in particular dplyr, and magrittr, you can write this code
df %<>%
mutate(VouchersEarnedCumsum = cumsum(VouchersEarned),
VouchersRedeemedCumsum = cumsum(VouchersRedeemed),
VouchersCumsumDifference = VouchersEarnedCumsum - lead(VouchersRedeemedCumsum, 2, default = max(VouchersRedeemedCumsum)),
VouchersCumsumDifference = as.numeric(VouchersCumsumDifference),
VouchersCumsumDifference = case_when(VouchersCumsumDifference < 0 ~ 0, T ~ VouchersCumsumDifference),
ExpiredVouchers = VouchersCumsumDifference - lag(VouchersCumsumDifference, default = 0),
ExpiredVouchersInDate = lag(ExpiredVouchers, 3, default = 0))
The resulting table, starting from your dataframe df, is
# A tibble: 8 x 8
Date VouchersEarned VouchersRedeemed VouchersEarnedCumsum VouchersRedeemedCumsum VouchersCumsumDifference ExpiredVouchers ExpiredVouchersInDate
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 01/01/2020 10 0 10 0 0 0 0
2 02/01/2020 8 9 18 9 1 1 0
3 03/01/2020 4 4 22 13 1 0 0
4 04/01/2020 2 4 24 17 2 1 0
5 05/01/2020 1 4 25 21 2 0 1
6 06/01/2020 0 1 25 22 2 0 0
7 07/01/2020 0 1 25 23 2 0 1
8 08/01/2020 0 0 25 23 2 0 0
Then, it is quite easy to translate each function in corresponding SQL queries using dbplyr, with translate_sql.
I'm not sure this code works in every situation you're going to face, I recommend to test it extensively.
EDIT I think this corrected code solves the problem.
df %<>%
mutate(VouchersEarnedCumsum = cumsum(VouchersEarned),
VouchersRedeemedCumsum = cumsum(VouchersRedeemed),
VouchersCumsumDifference = VouchersEarnedCumsum - lead(VouchersRedeemedCumsum, 2, default = max(VouchersRedeemedCumsum)),
VouchersCumsumDifference = as.numeric(VouchersCumsumDifference),
VouchersCumsumDifference = case_when(VouchersCumsumDifference < 0 ~ 0, T ~ VouchersCumsumDifference),
VouchersCumsumDifference = accumulate(VouchersCumsumDifference, max),
ExpiredVouchers = VouchersCumsumDifference - lag(VouchersCumsumDifference, default = 0),
ExpiredVouchersInDate = lag(ExpiredVouchers, 3, default = 0))
If you just need the total vouchers expired at a point in time you can use the following query and get the value from the max date line. If not exactly what you need, perhaps it can be a good starting point.
declare @vouchers table (VDate date, VouchersEarned int, VouchersRedeemed int);
insert @vouchers values
('2020-01-01', 10, 0),
('2020-01-02', 8, 9),
('2020-01-03', 4, 4),
('2020-01-04', 2, 4),
('2020-01-05', 1, 4),
('2020-01-06', 0, 1),
('2020-01-07', 0, 1),
('2020-01-08', 0, 0);
--select * from @vouchers;
declare @expiry_days int = 3;
declare @start_date date = '2020-01-01';
declare @end_date date = '2020-01-08';
with tally_cte (N) as ( --just a list of numbers 1-n
--need at lease as many rows as dates in date range other options including a permanent tally table here: https://www.sqlservercentral.com/blogs/tally-tables-in-t-sql
-- 1000 rows
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n)
), days_cte as ( --a list of all dates in date range to include records for days with no earned or redeemed
select VDate = dateadd(day, t.N - 1, @start_date)
from tally_cte t
where t.N <= datediff(day, @start_date, @end_date) + 1
)
select xed.VDate, xed.VouchersEarned, xed.VouchersRedeemed, xed.CumulativeEarned, xed.CumulativeRedeemed
, Expiring = lag(xed.Expiring, 1, 0) over (order by xed.VDate)
from (--determine expired vouchers with a lag
select xing.VDate, xing.VouchersEarned, xing.VouchersRedeemed, xing.CumulativeEarned, xing.CumulativeRedeemed
, Expiring = case when xing.Expiring < 0 then 0 else xing.Expiring end
from (--determine expiring vouchers
select v.*
, Expiring = lag(v.CumulativeEarned, @expiry_days - 1, 0) over (order by v.VDate) - v.CumulativeRedeemed
from (--basic query with cumulative earned and redeemed
select v.*
, CumulativeEarned = sum(v.VouchersEarned) over (order by v.VDate)
, CumulativeRedeemed = sum(v.VouchersRedeemed) over (order by v.vDate)
from days_cte d
join @vouchers v
on v.VDate = d.VDate
) v
) xing
) xed
order by xed.VDate;
Output:
VDate VouchersEarned VouchersRedeemed CumulativeEarned CumulativeRedeemed Expiring
---------- -------------- ---------------- ---------------- ------------------ -----------
2020-01-01 10 0 10 0 0
2020-01-02 8 9 18 9 0
2020-01-03 4 4 22 13 0
2020-01-04 2 4 24 17 0
2020-01-05 1 4 25 21 1
2020-01-06 0 1 25 22 1
2020-01-07 0 1 25 23 2
2020-01-08 0 0 25 23 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With