I want to implement the sum(x)
of N next rows for each data.table
row within a group, where N is value from window
column.
Code to generate sample data:
set.seed(100)
ids <- 1:100
x <- floor(runif(100, 1, 100))
groups <- floor(runif(100, 1, 10)) * 10
window <- floor(runif(100, 1, 5))
library('data.table')
data <- data.table(ids, x, groups, window)
setkey(data, groups, ids)
Top rows:
ids x groups window
1: 3 55 10 4
2: 9 55 10 1
3: 13 28 10 1
4: 16 67 10 3
5: 26 17 10 3
6: 30 28 10 2
7: 36 89 10 2
8: 38 63 10 3
9: 42 86 10 3
10: 48 88 10 1
11: 49 21 10 1
12: 59 60 10 3
13: 65 45 10 4
14: 67 46 10 2
15: 88 25 10 4
16: 19 36 20 2
So for the first row the resulting value will be calculated based on the sum of current and next 4 rows: res = 55 + 55 + 28 + 67 + 17 = 222
For the row 15 where the group ends I just need the value of the current row: res = 25 + 0 (no rows) = 25.
This is a pseudo code for this logic:
res <- data[, .(ids, groups, x, window ,
result = sum(.SD[.CurrentRow:(.CurrentRow + Window)], na.rm = T)),
by = groups, .SDcols = c("x")]
I hope this can be implemented via data.table
. I want to avoid for
loop implementation for this.
I'm not sure it's possible to do this without iterating over all rows, so here's one such solution:
data[, end := pmin(.I + window, .I[.N]), by = groups][
, res := sum(data$x[.I:end]), by = 1:nrow(data)][1:16]
# ids x groups window end res
# 1: 3 55 10 4 5 222
# 2: 9 55 10 1 3 83
# 3: 13 28 10 1 4 95
# 4: 16 67 10 3 7 201
# 5: 26 17 10 3 8 197
# 6: 30 28 10 2 8 180
# 7: 36 89 10 2 9 238
# 8: 38 63 10 3 11 258
# 9: 42 86 10 3 12 255
#10: 48 88 10 1 11 109
#11: 49 21 10 1 12 81
#12: 59 60 10 3 15 176
#13: 65 45 10 4 15 116
#14: 67 46 10 2 15 71
#15: 88 25 10 4 15 25
#16: 19 36 20 2 18 173
As alexis_laz points out, you can do better by computing the cumsum
once and then subtracting the extra part out, thus avoiding explicitly iterating over rows:
data[, res := { cs <- cumsum(x);
cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)}
, by = groups]
I will try to explain what happens here:
res := {...}
adds a column to our data.table with R calculation inside the brackets;cs = cumsum(x)
calculates running sum for all rows within the group;cs[pmin(1:.N + window, .N)]
takes value of running sum at the end of the window or at last row of the group;shift(cs, fill = 0)
gets the running sum from the preceding row;As there are several working answers to this question, I think it is worth providing benchmarking here:
library(microbenchmark)
m <- microbenchmark(
"tapply(rawr)" = tapplyWay(dd),
"data.table(eddi)" = data[, end := pmin(.I + window, .I[.N]), by = groups][
, res := sum(data$x[.I:end]), by = 1:nrow(data)],
"data.table(alexis_laz)" = data[, res := {cs = cumsum(x); cs[pmin(1:.N + window, .N)] - shift(cs, fill = 0)}
, by = groups],
times = 10)
print(m)
boxplot(m)
Result for 10^5 rows sample:
Unit: milliseconds
expr min lq mean median uq max neval
tapply(rawr) 2575.12 2761.365 2898.63 2905.77 3041.08 3127.86 10
data.table(eddi) 1418.92 1570.230 1758.70 1656.14 1977.59 2358.85 10
dt(alexis_laz) 6.82 7.73 8.78 8.09 10.37 12.37119 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With