Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copy-on-modify semantic on a vector does not append in a loop. Why?

This question sounds to be partially answered here but this is not enough specific to me. I would like to understand better when an object is updated by reference and when it is copied.

The simpler example is vector growing. The following code is blazingly inefficient in R because the memory is not allocated before the loop and a copy is made at each iteration.

  x = runif(10)
  y = c() 

  for(i in 2:length(x))
    y = c(y, x[i] - x[i-1])

Allocating the memory enable to reserve some memory without reallocating the memory at each iteration. Thus this code is drastically faster especially with long vectors.

  x = runif(10)
  y = numeric(length(x))

  for(i in 2:length(x))
    y[i] = x[i] - x[i-1]

And here comes my question. Actually when a vector is updated it does move. There is a copy that is made as shown below.

a = 1:10
pryr::tracemem(a)
[1] "<0xf34a268>"
a[1] <- 0L
tracemem[0xf34a268 -> 0x4ab0c3f8]:
a[3] <-0L
tracemem[0x4ab0c3f8 -> 0xf2b0a48]:  

But in a loop this copy does not occur

y = numeric(length(x))
for(i in 2:length(x))
{
   y[i] = x[i] - x[i-1]
   print(address(y))
}

Gives

[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0" 

I understand why a code is slow or fast as a function of the memory allocations but I don't understand the R logic. Why and how, for the same statement, in a case the update is made by reference and in the other case the update in made by copy. In the general case how can we know what will happen.

like image 666
JRR Avatar asked Jan 12 '18 16:01

JRR


2 Answers

This is covered in Hadley's Advanced R book. In it he says (paraphrasing here) that whenever 2 or more variables point to the same object, R will make a copy and then modify that copy. Before going into examples, one important note which is also mentioned in Hadley's book is that when you're using RStudio

the environment browser makes a reference to every object you create on the command line.

Given your observed behavior, I'm assuming you're using RStudio which we will see will explain why there are actually 2 variables pointing to a instead of 1 like you might expect.

The function we'll use to check how many variables are pointing to an object is refs(). In the first example you posted you can see:

library(pryr)
a = 1:10
refs(x)
#[1] 2

This suggests (which is what you found) that 2 variables are pointing to a and thus any modification to a will result in R copying it, then modifying that copy.

Checking the for loop we can see that y always has the same address and that refs(y) = 1 in the for loop. y is not copied because there are no other references pointing to y in your function y[i] = x[i] - x[i-1]:

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1" 

On the other hand if introduce a non-primitive function of y in your for loop you would see that address of y changes each time which is more in line with what we would expect:

is.primitive(lag)
#[1] FALSE

for(i in 2:length(x))
{
  y[i] = lag(y)[i]
  print(c(address(y), refs(y)))
}

#[1] "0x19b31600" "1"         
#[1] "0x19b31948" "1"         
#[1] "0x19b2f4a8" "1"         
#[1] "0x19b2d2f8" "1"         
#[1] "0x19b299d0" "1"         
#[1] "0x19b1bf58" "1"         
#[1] "0x19ae2370" "1"         
#[1] "0x19a649e8" "1"         
#[1] "0x198cccf0" "1"  

Note the emphasis on non-primitive. If your function of y is primitive such as - like: y[i] = y[i] - y[i-1] R can optimize this to avoid copying.

Credit to @duckmayr for helping explain the behavior inside the for loop.

like image 71
Mike H. Avatar answered Oct 06 '22 11:10

Mike H.


I complete the @MikeH. awnser with this code

library(pryr)

x = runif(10)
y = numeric(length(x))
print(c(address(y), refs(y)))

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

print(c(address(y), refs(y)))

The output shows clearly what happened

[1] "0x7872180" "2"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1" 
[1] "0x765b860" "2"  

There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y is updated by reference. On loop exit y become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.

like image 39
JRR Avatar answered Oct 06 '22 11:10

JRR