I want to find the sum of the previous n
rows in a dataframe. E.g:
id = 1:10
vals = c(4,7,2,9,7,0,4,6,1,8)
test = data.frame(id,vals)
So, for n=3
, I'd want to calculate the next column as:
test$sum = c(NA, NA, 13,18,18,16,11,10,11,15)
The closest I've come is creating a new column using:
test$valprevious = c(NA, head(test$vals,-1)
Then using a loop to repeat this n
times, then sum
across the columns. I'm sure this isn't the most efficient method, are there any functions that access n
previous rows? Or another way to do this?
You can use the rollsumr
function from the zoo
package for this:
library(zoo)
test$sums <- rollsumr(test$vals, k = 3, fill = NA)
which gives:
> test id vals sums 1 1 4 NA 2 2 7 NA 3 3 2 13 4 4 9 18 5 5 7 18 6 6 0 16 7 7 4 11 8 8 6 10 9 9 1 11 10 10 8 15
This is the same as using the rollsum
function with the align = 'right'
parameter:
rollsum(test$vals, k = 3, fill = NA, align = 'right')
As an alternative, you can use Reduce
with shift
from the data.table package:
library(data.table)
setDT(test)[, sums := Reduce(`+`, shift(vals, 0:2))]
which gives the same result:
> test id vals sums 1: 1 4 NA 2: 2 7 NA 3: 3 2 13 4: 4 9 18 5: 5 7 18 6: 6 0 16 7: 7 4 11 8: 8 6 10 9: 9 1 11 10: 10 8 15
Recently, fast rolling functions were added to data.table. Thus, another option would be:
setDT(test)[, sums := frollsum(vals, 3)]
A nice base R alternative as proposed by @alexis_laz in the comments:
n <- 3
cs <- cumsum(test$vals)
test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n)))
Another two option as proposed by @Khashaa in the comments:
# with base R
n <- 3
test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n)))
# with RcppRoll
library(RcppRoll)
test$sums <- roll_sumr(test$vals, 3)
As @alexis_laz noted in the comments, some of the solutions might create overhead in recalculating sums and re-creating length
-vectors. This may result in differences in computation speed. As a benchmark on such a small dataset isn't really meaningful, I'll benchmark the different solutions on a large dataset that mimics the example dataset:
# window size
n <- 3
# creating functions of the different solutions:
alexis_laz <- function(test) {cs <- cumsum(test$vals); test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n)))}
khashaa <- function(test) {test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n)))}
rcpp_roll <- function(test) test$sums <- roll_sumr(test$vals, n)
zoo_roll <- function(test) test$sums <- rollsumr(test$vals, k=n, fill=NA)
dt_reduce <- function(test) setDT(test)[, sums := Reduce(`+`, shift(vals, 0:(n-1)))]
dt_froll <- function(test) setDT(test)[, sums := frollsum(vals, n)]
# load the 'bench' package
library(bench)
# create a big test dataset
test <- data.frame(id=rep(1:10,1e7), vals=sample(c(4,7,2,9,7,0,4,6,1,8),1e7,TRUE))
# run the benchmark
big_bm <- mark(alexis_laz(test),
khashaa(test),
rcpp_roll(test),
zoo_roll(test),
dt_reduce(test),
dt_froll(test),
iterations = 1,
check = FALSE)
# extract some core measures and sort them
big_bm %>% select(expression, median, mem_alloc) %>% arrange(median)
which gives:
expression median mem_alloc <bch:expr> <bch:tm> <bch:byt> 1 dt_froll(test) 776.35ms 1.49GB 2 rcpp_roll(test) 1.23s 762.94MB 3 dt_reduce(test) 2.12s 4.47GB 4 alexis_laz(test) 3.68s 4.47GB 5 khashaa(test) 8.35s 5.21GB 6 zoo_roll(test) 33.32s 22.63GB
As you can see, the new frollsum
-function from the data.table-package is the clear winner with regard to speed. When considering memory allocation, roll_sumr
from rcpproll needs the least amount of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With