This is my first post on stackoverflow, so please forgive me if my post is not detailed enough.
I have a data table with two columns (date and group ID). At the current date, I want to count the number of group occurrences that have occurred within the last x days. For my example below, we can say the last 30 days.
date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
dt = data.table(cbind(group,date))
group date
1: G 2014-04-01
2: G 2014-04-12
3: F 2014-04-07
4: G 2014-05-03
5: E 2014-04-14
6: E 2014-05-04
7: H 2014-03-31
8: H 2014-04-18
9: H 2014-04-23
10: A 2014-04-01
So, my desired new column would look like this:
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1 (not including first G since it is outside 30 days)
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
I was able to use dplyr to perform a non-window count on counting the occurrences of the group at the current date, but I am struggling to find a way to do a 30 day count. For the non-window count, I did the following:
dt = data.table(dt %>%
group_by(group) %>%
mutate(count = row_number() - 1))
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 2
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
This is a small sample of the dataset, where the entire dataset contain a few million rows, so I would need something efficient. Any tips or suggestions would be greatly appreciated. Thank you in advance!
A data.table
option
dt[, date := as.Date(date)][, count := cumsum(date <= first(date) + 30) - 1, group]
gives
> dt
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
A dplyr
option following similar idea
dt %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(count = cumsum(date <= first(date) + 30) - 1) %>%
ungroup()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With