I am trying to implement a sum of consecutive values that fall with 365 days of each other, grouped by a unique identifier in R. For example, for date 1 of a particular ID, we would add dates 2,3,4 (fall within 365 days) of the same ID to get a total cost for date 1. Then for date 2, we would add 3 and 4 to get that total cost and so on. I have tried several rolling sums (R dplyr rolling sum) and similar solutions from dplyr that take sum consecutive values (Calculate sum of a column if the difference between consecutive rows meets a condition) with certain constraints but could not get the code to distinguish the amount of days. I have included a sample dataset with a solution dataset for an example that I am looking for.
Starting Data Set
ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df2<-data.frame(ID,admitdt,cost,cost365)
ID admitdt cost
1 1 2014-10-19 2000
2 1 2014-10-24 14077
3 1 2015-01-31 5000
4 1 2016-01-20 200
5 1 2017-06-30 560
6 1 2017-07-17 5000
7 2 2015-04-21 888
8 2 2015-04-22 5959
9 2 2015-05-04 1819
10 2 2015-07-25 7508
11 3 2014-11-11 6406
Solution:
ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,500,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df2<-data.frame(ID,admitdt,cost,cost365)
ID admitdt cost cost365
1 1 2014-10-19 2000 21077
2 1 2014-10-24 14077 19077
3 1 2015-01-31 5000 5200
4 1 2016-01-20 200 200
5 1 2017-06-30 560 5560
6 1 2017-07-17 5000 5000
7 2 2015-04-21 888 16174
8 2 2015-04-22 5959 15286
9 2 2015-05-04 1819 9327
10 2 2015-07-25 7508 7508
11 3 2014-11-11 6406 6406
I am giving 2 methods each in slider and runner. Out of these I like slider because of its clarity of syntax. Neverthess, the strategy in both are nearly same,
date column will act as index in both..before and .after agruments, which in the instant case you need only after = days(365) (that is integrated with lubridate)-364 there.In slider you can do
library(tidyverse)
ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df<-data.frame(ID,admitdt,cost)
df
#> ID admitdt cost
#> 1 1 2014-10-19 2000
#> 2 1 2014-10-24 14077
#> 3 1 2015-01-31 5000
#> 4 1 2016-01-20 200
#> 5 1 2017-06-30 560
#> 6 1 2017-07-17 5000
#> 7 2 2015-04-21 888
#> 8 2 2015-04-22 5959
#> 9 2 2015-05-04 1819
#> 10 2 2015-07-25 7508
#> 11 3 2014-11-11 6406
library(slider)
library(lubridate)
df %>% group_by(ID) %>%
mutate(admitdt = as.Date(admitdt),
cost365 = slider::slide_index_sum(x = cost,
i = admitdt,
after = days(365)))
#> # A tibble: 11 x 4
#> # Groups: ID [3]
#> ID admitdt cost cost365
#> <dbl> <date> <dbl> <dbl>
#> 1 1 2014-10-19 2000 21077
#> 2 1 2014-10-24 14077 19077
#> 3 1 2015-01-31 5000 5200
#> 4 1 2016-01-20 200 200
#> 5 1 2017-06-30 560 5560
#> 6 1 2017-07-17 5000 5000
#> 7 2 2015-04-21 888 16174
#> 8 2 2015-04-22 5959 15286
#> 9 2 2015-05-04 1819 9327
#> 10 2 2015-07-25 7508 7508
#> 11 3 2014-11-11 6406 6406
Or in runner
library(dplyr, warn.conflicts = F)
ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df<-data.frame(ID,admitdt,cost)
library(runner)
df %>% group_by(ID) %>%
mutate(admitdt = as.Date(admitdt),
cost365 = runner::sum_run(x = cost,
idx = admitdt,
k = 365,
lag = -364))
#> # A tibble: 11 x 4
#> # Groups: ID [3]
#> ID admitdt cost cost365
#> <dbl> <date> <dbl> <dbl>
#> 1 1 2014-10-19 2000 21077
#> 2 1 2014-10-24 14077 19077
#> 3 1 2015-01-31 5000 5200
#> 4 1 2016-01-20 200 200
#> 5 1 2017-06-30 560 5560
#> 6 1 2017-07-17 5000 5000
#> 7 2 2015-04-21 888 16174
#> 8 2 2015-04-22 5959 15286
#> 9 2 2015-05-04 1819 9327
#> 10 2 2015-07-25 7508 7508
#> 11 3 2014-11-11 6406 6406
Created on 2021-07-19 by the reprex package (v2.0.0)
Here's an approach with purrr::map:
library(dplyr); library(purrr)
df2 %>%
mutate(admitdt = as.Date(admitdt)) %>%
group_by(ID) %>%
mutate(cost365 = map_dbl(admitdt,~sum(cost[(.x - admitdt) <= 0 &
(.x - admitdt) >= -365])))
# A tibble: 11 x 4
# Groups: ID [3]
ID admitdt cost cost365
<dbl> <date> <dbl> <dbl>
1 1 2014-10-19 2000 21077
2 1 2014-10-24 14077 19077
3 1 2015-01-31 5000 5200
4 1 2016-01-20 200 200
5 1 2017-06-30 560 1060
6 1 2017-07-17 500 500
7 2 2015-04-21 888 16174
8 2 2015-04-22 5959 15286
9 2 2015-05-04 1819 9327
10 2 2015-07-25 7508 7508
11 3 2014-11-11 6406 6406
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With