Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way of subsetting vectors

Tags:

performance

r

I need to calculate the mean and variance of a subset of a vector. Let x be the vector and y be an indicator for whether the observation is in the subset. Which is more efficient:

sub.mean <- mean(x[y])
sub.var  <-  var(x[y])

or

sub      <- x[y]
sub.mean <- mean(sub)
sub.var  <-  var(sub)
sub      <- NULL

The first approach doesn't create a new object explicitly; but do the calls to mean and var do that implicitly? Or do they work on the original vector as stored?

Is the second faster because it doesn't have to do the subsetting twice?

I'm concerned with speed and with memory management for large data sets.

like image 830
Charlie Avatar asked Feb 26 '13 15:02

Charlie


1 Answers

Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:

f1 = function(x, y) {
    sub.mean <- mean(x[y])
    sub.var  <-  var(x[y])
}

f2 = function(x, y) {
    sub      <- x[y]
    sub.mean <- mean(sub)
    sub.var  <-  var(sub)
    sub      <- NULL
}

x = rnorm(10000000)
y = rbinom(10000000, 1, .5)

print(system.time(f1(x, y)))
#   user  system elapsed 
#  0.403   0.037   0.440 
print(system.time(f2(x, y)))
#   user  system elapsed 
#  0.233   0.002   0.235 

This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).

like image 198
David Robinson Avatar answered Oct 25 '22 16:10

David Robinson