Here is my original data frame:
df <- read.table(text="
Date Index Event
2014-03-31 A x
2014-03-31 A x
2014-03-31 A y
2014-04-01 A y
2014-04-01 A x
2014-04-01 B x
2014-04-02 B x
2014-04-03 A x
2014-09-30 B x", header = T, stringsAsFactors = F)
date_range <- seq(as.Date(min(df$Date)), as.Date(max(df$Date)), 'days')
indices <- unique(df$Index)
events_table <- unique(df$Event)
I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column for all dates prior to the value in the Date column. Sometimes there are no records for each index or every date.
Here is my desired output:
Date Index cumsum(Event = x) cumsum(Event = y)
2014-03-31 A 0 0
2014-03-31 B 0 0
2014-04-01 A 2 1
2014-04-01 B 0 0
2014-04-02 A 3 2
2014-04-02 B 1 0
...
2014-09-29 A 4 2
2014-09-29 B 2 0
2014-09-30 A 4 2
2014-09-30 B 2 0
FYI -- this is a simplified version of the data frame. There are ~200,000 records per year with hundreds of different Index fields for each Date.
I've done this in the past before my hard drive fried using by and maybe aggregate, but the process was very slow and I'm not able to get it worked out this time around. I've also tried ddply, but I'm not able to get the cumsum function to work with it. Using ddply, I tried something like:
ddply(xo1, .(Date,Index), summarise,
sum.x = sum(Event == 'x'),
sum.y = sum(Event == 'y'))
to no avail.
Through searching, I've found Replicating an Excel SUMIFS formula
which gets me the cumulative part of my project, but with this I wasn't able to figure out how to summarize it down to only one record per date/index combo. I also came across sum/aggregate data based on dates, R but here I wasn't able to work out the dynamic date aspect.
Thanks for anyone that can help!
library(dplyr)
library(tidyr)
df$Date <- as.Date(df$Date)
Step 1: Generate a full list of {Date, Index} pairs
full_dat <- expand.grid(
Date = date_range,
Index = indices,
stringsAsFactors = FALSE
) %>%
arrange(Date, Index) %>%
tbl_df
Step 2: Define a cumsum() function that ignores NA
cumsum2 <- function(x){
x[is.na(x)] <- 0
cumsum(x)
}
Step 3: Generate totals per {Date, Index}, join with full {Date, Index} data, and compute the lagged cumulative sum.
df %>%
group_by(Date, Index) %>%
summarise(
totx = sum(Event == "x"),
toty = sum(Event == "y")
) %>%
right_join(full_dat, by = c("Date", "Index")) %>%
group_by(Index) %>%
mutate(
cumx = lag(cumsum2(totx)),
cumy = lag(cumsum2(toty))
) %>%
# some clean up.
select(-starts_with("tot")) %>%
mutate(
cumx = ifelse(is.na(cumx), 0, cumx),
cumy = ifelse(is.na(cumy), 0, cumy)
)
Would something like this using dplyr and tidyr work?
library(dplyr)
library(tidyr)
df %>%
group_by(Date, Index, Event) %>%
summarise(events = n()) %>%
group_by(Index, Event) %>%
mutate(cumsum_events = cumsum(events)) %>%
select(-events) %>%
spread(Event, cumsum_events) %>%
rename(sum.x = x,
sum.y = y)
# Date Index sum.x sum.y
#1 2014-03-31 A 2 1
#2 2014-04-01 A 3 2
#3 2014-04-01 B 1 NA
#4 2014-04-02 B 2 NA
#5 2014-04-03 A 4 NA
#6 2014-09-30 B 3 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With