Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r calculating rolling average with window based on value (not number of rows or date/time variable)

I'm quite new to all the packages meant for calculating rolling averages in R and I hope you can show me in the right direction.

I have the following data as an example:

ms <- c(300, 300, 300, 301, 303, 305, 305, 306, 308, 310, 310, 311, 312,
    314, 315, 315, 316, 316, 316, 317, 318, 320, 320, 321, 322, 324,
    328, 329, 330, 330, 330, 332, 332, 334, 334, 335, 335, 336, 336,
    337, 338, 338, 338, 340, 340, 341, 342, 342, 342, 342)
correct <- c(1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
         1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
         1, 0, 0, 1, 0, 0, 1, 1, 0, 0)
df <- data.frame(ms, correct)

ms are time points in milliseconds and correct is whether a specific action is performed correctly
(1 = correct, 0 = not correct).

My goal now is that I'd like to calculate the percentage correct (or average) over windows of a set number of milliseconds. As you can see, certain time points are missing and certain time points occur multiple times. I, therefore, do not want to do a filter based on row number. I've looked into some packages such as "tidyquant" but it seems to me that these kind of packages need a time/date variable instead of a numerical variable to determine the window over which values are averaged. Is there a way to specify the window on the numerical value of df$ms?

like image 452
RmyjuloR Avatar asked Nov 13 '18 20:11

RmyjuloR


Video Answer


1 Answers

For the sake of completeness, here is an answer which uses data.table to aggregate in a non-equi join.

The OP has clarified in comments, that he is looking for a sliding window of 5 ms, i.e., windows that go 300-304, 301-305, 302-306 etc.

As there is no data point with 302 ms in OP's data set, the missing values need to be filled up.

library(data.table)
ws <- 5   # define window size
setDT(df)[SJ(start = seq(min(ms), max(ms), 1))[, end := start + ws - 1], 
          on = .(ms >= start, ms <= end),
          .(share_correct = mean(correct)), by = .EACHI]
     ms  ms share_correct
 1: 300 304     0.4000000
 2: 301 305     0.0000000
 3: 302 306     0.2500000
 4: 303 307     0.2500000
 5: 304 308     0.2500000
 6: 305 309     0.2500000
 7: 306 310     0.2500000
 8: 307 311     0.0000000
 9: 308 312     0.2000000
10: 309 313     0.2500000
11: 310 314     0.2000000
12: 311 315     0.4000000
13: 312 316     0.4285714
14: 313 317     0.2857143
15: 314 318     0.3750000
16: 315 319     0.4285714
17: 316 320     0.4285714
18: 317 321     0.4000000
19: 318 322     0.4000000
20: 319 323     0.2500000
21: 320 324     0.4000000
22: 321 325     0.3333333
23: 322 326     0.5000000
24: 323 327     1.0000000
25: 324 328     1.0000000
26: 325 329     0.5000000
27: 326 330     0.2000000
28: 327 331     0.2000000
29: 328 332     0.4285714
30: 329 333     0.3333333
31: 330 334     0.2857143
32: 331 335     0.5000000
33: 332 336     0.3750000
34: 333 337     0.2857143
35: 334 338     0.3000000
36: 335 339     0.3750000
37: 336 340     0.3750000
38: 337 341     0.4285714
39: 338 342     0.4000000
40: 339 343     0.4285714
41: 340 344     0.4285714
42: 341 345     0.4000000
43: 342 346     0.5000000
     ms  ms share_correct

If the OP would be interested only in windows where the starting point exist in the dataset the code can be simplified:

setDT(df)[SJ(start = unique(ms))[, end := start + ws - 1], 
          on = .(ms >= start, ms <= end),
          .(share_correct = mean(correct)), by = .EACHI]
     ms  ms share_correct
 1: 300 304     0.4000000
 2: 301 305     0.0000000
 3: 303 307     0.2500000
 4: 305 309     0.2500000
 5: 306 310     0.2500000
 6: 308 312     0.2000000
 7: 310 314     0.2000000
 8: 311 315     0.4000000
 9: 312 316     0.4285714
10: 314 318     0.3750000
11: 315 319     0.4285714
12: 316 320     0.4285714
13: 317 321     0.4000000
14: 318 322     0.4000000
15: 320 324     0.4000000
16: 321 325     0.3333333
17: 322 326     0.5000000
18: 324 328     1.0000000
19: 328 332     0.4285714
20: 329 333     0.3333333
21: 330 334     0.2857143
22: 332 336     0.3750000
23: 334 338     0.3000000
24: 335 339     0.3750000
25: 336 340     0.3750000
26: 337 341     0.4285714
27: 338 342     0.4000000
28: 340 344     0.4285714
29: 341 345     0.4000000
30: 342 346     0.5000000
     ms  ms share_correct

In both cases, a data.table containing the intervals [start, end] is created on the fly and right joined to df. During the non-equi join, the intermediate result is immediately grouped by the join parameters (by = .EACHI) and aggregated. Note that closed intervals are used to be in line with OP's expectations.

like image 59
Uwe Avatar answered Sep 29 '22 16:09

Uwe