Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, why is v[length(v)+1] = x better than c(v, x)?

Tags:

r

I wrote the following two functions, one to test increasing vector size using concatenate, and one using brackets:

c_test <- function(n) {
  cv = c()
  for(i in 1:n) cv = c(cv, i)
  cv
  }

b_test <- function(n) {
  bv = c()
  for (i in 1:n) bv[i] = i
  bv
  }

library(microbenchmark)
microbenchmark(c_test(1e+4), b_test(1e+4), times = 100)

#Unit: milliseconds
#          expr       min        lq      mean    median        uq      max neval
# c_test(10000) 140.27923 145.73282 156.82319 148.16175 151.74713 267.2393   100
# b_test(10000)  49.58033  54.42992  56.24268  54.86033  56.30862 132.8394   100

This is a big time difference, and I don't understand why using brackets is so much better than using concatenate. It seems like in both cases it would be taking time to allocate new memory, but this doesn't seem true. I also thought it might be that c(v, x) is converting up x to be the same type as v before merging, but saying v[i] = as.vector(x) isn't a significant time cost.

like image 714
Bill Avatar asked Jan 13 '17 16:01

Bill


1 Answers

This should probably be a comment as I don't know the actual answer, but it was getting too long.

"c" and "[" are both primitive, internal and generic. This means method dispatch is done by C functions, and that's as far as I can get in answering your actual question. Something mysterious is going on there, and "[" is more efficient than "c" in this particular respect.

However, I did want to point out that, as per some but not all of the comments, both these methods are inefficient and not just on account of vectorisation. Preallocating memory space for the size of the vector you expect does help very materially, much more so than the difference between c and [. Preallocation gives you a 70% to 90% speed up from the [ version:

# very poor - repeated calls to c() to extend
c_test <- function(n) {
  cv = c()
  for(i in 1:n) cv = c(cv, i)
  cv
}

# slightly better - just use []
b_test <- function(n) {
  bv = c()
  for (i in 1:n) bv[i] = i
  bv
}

# much better practice - preallocate length of the vector
d_test <- function(n) {
  bv = numeric(n)
  for (i in 1:n) bv[i] = i
  bv
}

# good practice if possible - vectorisation
e_test <- function(n) {
  bv = 1:n
  bv
}


library(microbenchmark)
microbenchmark(c_test(1e+4), b_test(1e+4), d_test(1e+4), e_test(1e+4), times = 100)

This gives:

Unit: microseconds
          expr        min         lq         mean     median         uq        max neval cld
 c_test(10000) 102355.753 111202.568 129250.53638 114237.234 132468.938 220005.926   100   c
 b_test(10000)  47337.481  52820.938  77029.01728  59450.864 116529.185 192643.555   100  b 
 d_test(10000)   6761.877   7492.741   7965.37288   7814.519   8353.778  11007.605   100 a  
 e_test(10000)      3.555      6.321      9.32347      8.692     10.272     27.259   100 a  

Also, as @Roland said "growing an object gets more expensive with increasing size". As the vector gets bigger there are less convenient spots available in memory to put it.

I appreciate that the e_test (vectorised) isn't applicable in your Fibonacci use case but left it in for comparison anyway, to give an idea of the scale of speed up when vectorisation is possible.

like image 168
Peter Ellis Avatar answered Nov 19 '22 00:11

Peter Ellis