Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine datasets by date range and categorical variable

Suppose I have two datasets. One contains a list of promotions with start/end dates, and the other contains monthly sales data for each program.

promotions = data.frame(
    start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")), 
    end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")), 
    program = c("a", "a", "a", "b", "b"))

sales = data.frame(
    year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")), 
    program = c("a", "b", "a", "a", "b"), 
    monthly.sales = c(200, 200, 200, 400, 200))

Note that sales$year.month.day is used to indicate year/month. Day is included so R can more simply treat the column as a vector of date objects, but it isn't relevant to the actual sales.

I need to determine the number of promotions that occurred per month for each program. Here's an example of a loop that produces the output I want:

sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
  sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
  if (nrow(sub[[i]]) > 1) {
    for (j in 1:nrow(sub[[i]])) {
      if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
        sales$count[i] = sales$count[i] + 1
      }
    }
  }
}

Example output:

 sales = data.frame(
    year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")), 
    program = c("a", "b", "a", "a", "b"), 
    monthly.sales = c(200, 200, 200, 400, 200),
    count = c(3, 1, 3, 3, 2)
)

However since my actual datasets are very large, this loop crashes when I run it in R.

Is there a more efficient way to achieve the same result? Perhaps something with dplyr?

like image 280
heo Avatar asked Dec 04 '22 00:12

heo


1 Answers

Using the newly implemented non-equi joins from the current development version of data.table:

require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)

ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE, 
        on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]

ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
#    program year.month.date monthly.sales N
# 1:       a      2013-02-01           200 3
# 2:       b      2014-09-01           200 1
# 3:       a      2013-08-01           200 3
# 4:       a      2013-04-01           400 3
# 5:       b      2012-11-01           200 2

See installation instructions for development version here.

like image 183
Arun Avatar answered Dec 14 '22 22:12

Arun