Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Find sum of previous n rows in dataframe




I want to find the sum of the previous n rows in a dataframe. E.g:

id = 1:10
vals = c(4,7,2,9,7,0,4,6,1,8)
test = data.frame(id,vals)

So, for n=3, I'd want to calculate the next column as:

test$sum = c(NA, NA, 13,18,18,16,11,10,11,15)

The closest I've come is creating a new column using:

test$valprevious = c(NA, head(test$vals,-1)

Then using a loop to repeat this n times, then sum across the columns. I'm sure this isn't the most efficient method, are there any functions that access n previous rows? Or another way to do this?

like image 370
Esme_ Avatar asked Jun 12 '16 10:06


1 Answers

You can use the rollsumr function from the zoo package for this:

test$sums <- rollsumr(test$vals, k = 3, fill = NA)

which gives:

> test
   id vals sums
1   1    4   NA
2   2    7   NA
3   3    2   13
4   4    9   18
5   5    7   18
6   6    0   16
7   7    4   11
8   8    6   10
9   9    1   11
10 10    8   15

This is the same as using the rollsum function with the align = 'right' parameter:

rollsum(test$vals, k = 3, fill = NA, align = 'right')

As an alternative, you can use Reduce with shift from the data.table package:

setDT(test)[, sums := Reduce(`+`, shift(vals, 0:2))]

which gives the same result:

> test
    id vals sums
 1:  1    4   NA
 2:  2    7   NA
 3:  3    2   13
 4:  4    9   18
 5:  5    7   18
 6:  6    0   16
 7:  7    4   11
 8:  8    6   10
 9:  9    1   11
10: 10    8   15

Recently, fast rolling functions were added to data.table. Thus, another option would be:

setDT(test)[, sums := frollsum(vals, 3)]

A nice base R alternative as proposed by @alexis_laz in the comments:

n <- 3
cs <- cumsum(test$vals)
test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n)))

Another two option as proposed by @Khashaa in the comments:

# with base R
n <- 3
test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n)))

# with RcppRoll
test$sums <- roll_sumr(test$vals, 3)


As @alexis_laz noted in the comments, some of the solutions might create overhead in recalculating sums and re-creating length-vectors. This may result in differences in computation speed. As a benchmark on such a small dataset isn't really meaningful, I'll benchmark the different solutions on a large dataset that mimics the example dataset:

# window size
n <- 3

# creating functions of the different solutions:
alexis_laz <- function(test) {cs <- cumsum(test$vals); test$sums <- c(rep_len(NA, n - 1), tail(cs, -(n - 1)) - c(0, head(cs, -n)))}
khashaa <- function(test) {test$sums <- c(rep_len(NA, n - 1), rowSums(embed(test$vals, n)))}
rcpp_roll <- function(test) test$sums <- roll_sumr(test$vals, n)
zoo_roll <- function(test) test$sums <- rollsumr(test$vals, k=n, fill=NA)
dt_reduce <- function(test) setDT(test)[, sums := Reduce(`+`, shift(vals, 0:(n-1)))]
dt_froll <- function(test) setDT(test)[, sums := frollsum(vals, n)]

# load the 'bench' package

# create a big test dataset
test <- data.frame(id=rep(1:10,1e7), vals=sample(c(4,7,2,9,7,0,4,6,1,8),1e7,TRUE))

# run the benchmark
big_bm <- mark(alexis_laz(test),
               iterations = 1,
               check = FALSE)

# extract some core measures and sort them
big_bm %>% select(expression, median, mem_alloc) %>% arrange(median)

which gives:

  expression         median mem_alloc
  <bch:expr>       <bch:tm> <bch:byt>
1 dt_froll(test)   776.35ms    1.49GB
2 rcpp_roll(test)     1.23s  762.94MB
3 dt_reduce(test)     2.12s    4.47GB
4 alexis_laz(test)    3.68s    4.47GB
5 khashaa(test)       8.35s    5.21GB
6 zoo_roll(test)     33.32s   22.63GB

As you can see, the new frollsum-function from the data.table-package is the clear winner with regard to speed. When considering memory allocation, roll_sumr from rcpproll needs the least amount of memory.

like image 149
Jaap Avatar answered Sep 20 '22 23:09
