Here is my original data frame:
df <- read.table(text="
Date Index Event
2014-03-31 A x
2014-03-31 A x
2014-03-31 A y
2014-04-01 A y
2014-04-01 A x
2014-04-01 B x
2014-04-02 B x
2014-04-03 A x
2014-09-30 B x", header = T, stringsAsFactors = F)
date_range <- seq(as.Date(min(df$Date)), as.Date(max(df$Date)), 'days')
indices <- unique(df$Index)
events_table <- unique(df$Event)
I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column for all dates prior to the value in the Date column. Sometimes there are no records for each index or every date.
Here is my desired output:
Date Index cumsum(Event = x) cumsum(Event = y)
2014-03-31 A 0 0
2014-03-31 B 0 0
2014-04-01 A 2 1
2014-04-01 B 0 0
2014-04-02 A 3 2
2014-04-02 B 1 0
...
2014-09-29 A 4 2
2014-09-29 B 2 0
2014-09-30 A 4 2
2014-09-30 B 2 0
FYI -- this is a simplified version of the data frame. There are ~200,000 records per year with hundreds of different Index fields for each Date.
I've done this in the past before my hard drive fried using by
and maybe aggregate
, but the process was very slow and I'm not able to get it worked out this time around. I've also tried ddply
, but I'm not able to get the cumsum
function to work with it. Using ddply
, I tried something like:
ddply(xo1, .(Date,Index), summarise,
sum.x = sum(Event == 'x'),
sum.y = sum(Event == 'y'))
to no avail.
Through searching, I've found Replicating an Excel SUMIFS formula
which gets me the cumulative part of my project, but with this I wasn't able to figure out how to summarize it down to only one record per date/index combo. I also came across sum/aggregate data based on dates, R but here I wasn't able to work out the dynamic date aspect.
Thanks for anyone that can help!
library(dplyr)
library(tidyr)
df$Date <- as.Date(df$Date)
Step 1: Generate a full list of {Date, Index} pairs
full_dat <- expand.grid(
Date = date_range,
Index = indices,
stringsAsFactors = FALSE
) %>%
arrange(Date, Index) %>%
tbl_df
Step 2: Define a cumsum()
function that ignores NA
cumsum2 <- function(x){
x[is.na(x)] <- 0
cumsum(x)
}
Step 3: Generate totals per {Date, Index}, join with full {Date, Index} data, and compute the lagged cumulative sum.
df %>%
group_by(Date, Index) %>%
summarise(
totx = sum(Event == "x"),
toty = sum(Event == "y")
) %>%
right_join(full_dat, by = c("Date", "Index")) %>%
group_by(Index) %>%
mutate(
cumx = lag(cumsum2(totx)),
cumy = lag(cumsum2(toty))
) %>%
# some clean up.
select(-starts_with("tot")) %>%
mutate(
cumx = ifelse(is.na(cumx), 0, cumx),
cumy = ifelse(is.na(cumy), 0, cumy)
)
Would something like this using dplyr
and tidyr
work?
library(dplyr)
library(tidyr)
df %>%
group_by(Date, Index, Event) %>%
summarise(events = n()) %>%
group_by(Index, Event) %>%
mutate(cumsum_events = cumsum(events)) %>%
select(-events) %>%
spread(Event, cumsum_events) %>%
rename(sum.x = x,
sum.y = y)
# Date Index sum.x sum.y
#1 2014-03-31 A 2 1
#2 2014-04-01 A 3 2
#3 2014-04-01 B 1 NA
#4 2014-04-02 B 2 NA
#5 2014-04-03 A 4 NA
#6 2014-09-30 B 3 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With