Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorize for loop in R

Tags:

r

I'm struggling to vectorize the repeated application of the following function, which I have currently implemented as a for loop. This small example is indicative of a problem with a larger dataset, for which vectorizing would allow for beneficial runtime improvements:

action = function(x,y,i) {

    firsttask = cumsum(x[which(x<y[i])])
    secondtask = mean(firsttask)
    thirdtask = min(firsttask[which(firsttask>secondtask)])
    fourthtask = length(firsttask)

    output = list(firsttask, data.frame(average=secondtask,
                                        min_over_mean=thirdtask,
                                        size=fourthtask))
    return(output)
}

thingofinterest = c(1:10)
limits = c(5:10)

test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}

test

I want to replace the for loop with a vectorized command, not any of the apply family of functions, as they do not always improve performance (I'm not suggesting there is anything wrong with for loops, I just need to optimize for speed in this case. See: Is R's apply family more than syntactic sugar?). How would I do this?

like image 312
Nigel Stackhouse Avatar asked Sep 02 '25 05:09

Nigel Stackhouse


1 Answers

You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:

timer <- function(action, thingofinterest, limits) {
  st <- system.time({           # for the wall time
    Rprof(interval=0.01)        # Start R's profile timing
    for(j in 1:1000) {          # 1000 function calls
      test = vector("list")
      for(i in 1:length(limits)) {
        test[[i]] = action(thingofinterest, limits, i)
      }
    }
    Rprof(NULL)  # stop the profiler
  })
  # return profiling results
  list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = min(firsttask[which(firsttask>mean(firsttask))])
  thirdtask = mean(firsttask)
  fourthtask = length(firsttask)
  output = list(firsttask, data.frame(average=secondtask,
                                      min_over_mean=thirdtask,
                                      size=fourthtask))
  return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   9.720   0.012   9.737 
# 
# [[2]]
#                 total.time total.pct self.time self.pct
# "system.time"         9.72    100.00      0.07     0.72
# "timer"               9.72    100.00      0.00     0.00
# "action"              9.65     99.28      0.24     2.47
# "data.frame"          8.53     87.76      0.84     8.64
# "as.data.frame"       5.50     56.58      0.44     4.53
# "force"               4.40     45.27      0.11     1.13

You can see that very little time is spent outside the call to your action function. Now, for is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.

And the thing that takes the most time in your action function is the call to data.frame. Remove that, and you get an enormous speedup.

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   1.020   0.000   1.021 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       1.01    100.00      0.06     5.94
# "timer"             1.01    100.00      0.00     0.00
# "action"            0.95     94.06      0.17    16.83
# "which"             0.57     56.44      0.23    22.77
# "mean"              0.25     24.75      0.13    12.87
# "<"                 0.20     19.80      0.20    19.80

Now you can also get rid of one of the calls to mean and both calls to which.

action2 = function(x,y,i) {
  firsttask = cumsum(x[x < y[i]])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[firsttask > secondtask])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   0.808   0.000   0.808 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       0.80    100.00      0.12    15.00
# "timer"             0.80    100.00      0.00     0.00
# "action"            0.68     85.00      0.24    30.00
# "<"                 0.20     25.00      0.20    25.00
# "mean"              0.13     16.25      0.08    10.00
# ">"                 0.05      6.25      0.05     6.25

Now you can see there's a "significant" amount of time spent doing stuff outside your action function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action function would finish in ~1 hour.

The results would be marginally better if I pre-allocated the test list outside of the for loop in the timer function, but the call to data.frame is the biggest time-consumer.

like image 94
Joshua Ulrich Avatar answered Sep 04 '25 21:09

Joshua Ulrich