Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative sums over run lengths. Can this loop be vectorized?

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.

dir.rle <- rle(df$dir)

I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently.

ndx <- 1
for(i in 1:length(dir.rle$lengths)) {
    l <- dir.rle$lengths[i] - 1
    s <- ndx
    e <- ndx+l
    tmp[s:e,]$cumval <- cumsum(df[s:e,]$val)
    ndx <- e + 1
}

The run lengths of dir define the start, s, and end, e, for each run. The above code works but it does not feel like idiomatic R code. I feel as if there should be another way to do it without the loop.

like image 733
Louis Marascio Avatar asked Dec 02 '22 00:12

Louis Marascio


2 Answers

This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr.

library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))


#What does the indexer column look like?
> head(dat)
     dir      value indexer
[1,]   1  0.5045807       1
[2,]   0  0.2660617       2
[3,]   1  1.0369641       3
[4,]   1 -0.4514342       3
[5,]  -1 -0.3968631       4
[6,]  -1 -2.1517093       4


#data.table approach
dat[, cumsum(value), by = indexer]

#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))
like image 139
Chase Avatar answered Dec 23 '22 13:12

Chase


Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).

I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:

df$group <- c(0, cumsum(!(diff(df$dir)==0)))

# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))
like image 22
Josh O'Brien Avatar answered Dec 23 '22 14:12

Josh O'Brien