Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum within a window that is defined in column

Tags:

I want to implement the sum(x) of N next rows for each data.table row within a group, where N is value from window column.

Code to generate sample data:

set.seed(100)
ids <- 1:100
x <- floor(runif(100, 1, 100))
groups <- floor(runif(100, 1, 10)) * 10
window <- floor(runif(100, 1, 5))

library('data.table')
data <- data.table(ids, x, groups, window)
setkey(data, groups, ids)

Top rows:

 ids  x groups window
 1:   3 55     10      4
 2:   9 55     10      1
 3:  13 28     10      1
 4:  16 67     10      3
 5:  26 17     10      3
 6:  30 28     10      2
 7:  36 89     10      2
 8:  38 63     10      3
 9:  42 86     10      3
10:  48 88     10      1
11:  49 21     10      1
12:  59 60     10      3
13:  65 45     10      4
14:  67 46     10      2
15:  88 25     10      4
16:  19 36     20      2

So for the first row the resulting value will be calculated based on the sum of current and next 4 rows: res = 55 + 55 + 28 + 67 + 17 = 222

For the row 15 where the group ends I just need the value of the current row: res = 25 + 0 (no rows) = 25.

This is a pseudo code for this logic:

res <- data[, .(ids, groups, x, window , 
            result = sum(.SD[.CurrentRow:(.CurrentRow + Window)], na.rm = T)), 
            by = groups, .SDcols = c("x")]

I hope this can be implemented via data.table. I want to avoid for loop implementation for this.

like image 597
Bulat Avatar asked May 07 '16 10:05

Bulat


1 Answers

I'm not sure it's possible to do this without iterating over all rows, so here's one such solution:

data[, end := pmin(.I + window, .I[.N]), by = groups][
     , res := sum(data$x[.I:end]), by = 1:nrow(data)][1:16]
#    ids  x groups window end res
# 1:   3 55     10      4   5 222
# 2:   9 55     10      1   3  83
# 3:  13 28     10      1   4  95
# 4:  16 67     10      3   7 201
# 5:  26 17     10      3   8 197
# 6:  30 28     10      2   8 180
# 7:  36 89     10      2   9 238
# 8:  38 63     10      3  11 258
# 9:  42 86     10      3  12 255
#10:  48 88     10      1  11 109
#11:  49 21     10      1  12  81
#12:  59 60     10      3  15 176
#13:  65 45     10      4  15 116
#14:  67 46     10      2  15  71
#15:  88 25     10      4  15  25
#16:  19 36     20      2  18 173

As alexis_laz points out, you can do better by computing the cumsum once and then subtracting the extra part out, thus avoiding explicitly iterating over rows:

data[, res := { cs <- cumsum(x); 
                cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)}
     , by = groups]

I will try to explain what happens here:

  • res := {...} adds a column to our data.table with R calculation inside the brackets;
  • cs = cumsum(x) calculates running sum for all rows within the group;
  • cs[pmin(1:.N + window, .N)] takes value of running sum at the end of the window or at last row of the group;
  • shift(cs, fill = 0) gets the running sum from the preceding row;
  • difference of the two gives the sum of items within the window.

As there are several working answers to this question, I think it is worth providing benchmarking here:

library(microbenchmark)
m <- microbenchmark(
               "tapply(rawr)" = tapplyWay(dd),
               "data.table(eddi)" = data[, end := pmin(.I + window, .I[.N]), by = groups][
                   , res := sum(data$x[.I:end]), by = 1:nrow(data)],
               "data.table(alexis_laz)" = data[, res := {cs = cumsum(x); cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)}
                                               , by = groups],
               times = 10)
print(m)
boxplot(m)

Result for 10^5 rows sample:

Unit: milliseconds
            expr     min     lq      mean    median      uq    max        neval
       tapply(rawr) 2575.12 2761.365 2898.63 2905.77  3041.08  3127.86    10
   data.table(eddi) 1418.92 1570.230 1758.70 1656.14  1977.59  2358.85    10
     dt(alexis_laz) 6.82    7.73     8.78    8.09     10.37    12.37119    10

benchmarking of solutions

like image 79
eddi Avatar answered Sep 28 '22 03:09

eddi