Suppose I have two datasets. One contains a list of promotions with start/end dates, and the other contains monthly sales data for each program.
promotions = data.frame(
start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")),
end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")),
program = c("a", "a", "a", "b", "b"))
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200))
Note that sales$year.month.day
is used to indicate year/month. Day is included so R can more simply treat the column as a vector of date objects, but it isn't relevant to the actual sales.
I need to determine the number of promotions that occurred per month for each program. Here's an example of a loop that produces the output I want:
sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
if (nrow(sub[[i]]) > 1) {
for (j in 1:nrow(sub[[i]])) {
if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
sales$count[i] = sales$count[i] + 1
}
}
}
}
Example output:
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200),
count = c(3, 1, 3, 3, 2)
)
However since my actual datasets are very large, this loop crashes when I run it in R.
Is there a more efficient way to achieve the same result? Perhaps something with dplyr?
Using the newly implemented non-equi joins from the current development version of data.table:
require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)
ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE,
on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]
ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
# program year.month.date monthly.sales N
# 1: a 2013-02-01 200 3
# 2: b 2014-09-01 200 1
# 3: a 2013-08-01 200 3
# 4: a 2013-04-01 400 3
# 5: b 2012-11-01 200 2
See installation instructions for development version here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With