Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

First bracketed assignment is as time-consuming as full assignment?

Regarding this answer in: What exactly is copy-on-modify semantics in R, and where is the canonical source?

We can see that, at the first time a vector is altered with '[<-', R copies the entire vector even if only a single entry is to be modifed. At the second time, however, the vector is altered "in place". This is noticeable without inspecting the address of the objects if we measure the time to create and modify a large vector:

> system.time(a <- rep(1L, 10^8))
   user  system elapsed 
   0.15    0.17    0.31 
> system.time(a[222L] <- 111L)
   user  system elapsed 
   0.26    0.08    0.34 
> system.time(a[333L] <- 111L)
   user  system elapsed 
      0       0       0

Note that there is no change of type/storage.mode.

So the question is: why is it not possible to optimize the first bracket assignment as well? In what situation this kind of behaviour (full copy at first modification) is actually needed?

EDIT: (spoiler!) As explained in the accepted answer below, this is nothing but an artifact of enclosing the first assignment in a system.time function call. This causes R to mark the memory space bound to a as possibly referring to more than one symbol, thus requiring duplication when changed. If we remove the enclosing calls, the vector is modified in place from the very first bracket assignment.

Thanks Martin for in-depth solution!

like image 811
Ferdinand.kraft Avatar asked May 07 '13 16:05

Ferdinand.kraft


2 Answers

Compare the "NAM()" part of

> a <- rep(1L, 10)
> .Internal(inspect(a))
@457b840 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,1,1,1,1,...

versus

> system.time(a <- rep(1L, 10))
[...]
> .Internal(inspect(a))
@4626f88 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,1,1,1,1,...

The "1" in the first example means that R thinks there is one reference to a, hence can be updated in place. The "2" means that R thinks there have been at least two references to a, hence duplication required if modified. Roughly, I rationalize this as the representation of the return value of rep() inside system.time, and its value outside system.time; the moral equivalent of f = function() { x <- rep(1L, 10); x }; a = f() rather than g = function() rep(1L, 10); a = g().

The real-world code a <- rep(1L, 10^8); a[123L] <- 231L would not involve a copy. We can time the assignment without artificially incrementing the NAMED count with

> a <- rep(1L, 10^8)
> .Internal(inspect(a))
@7f972b571010 13 INTSXP g0c7 [NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...
> system.time(a[123L] <- a[321L])
   user  system elapsed 
      0       0       0 
like image 108
Martin Morgan Avatar answered Oct 21 '22 22:10

Martin Morgan


Edit following Joshua's comment: The behaviour shown below is limited to R-studio!!

To answer OP's question, the underlying reason for the copy (as @MartinMorgan explains) is due to the NAM(2) SEXP object for a. If the first command were to not include a system.time(.), then a <- rep(1, 10^8) returns a NAM(1) type which then results in no copy on both the assignments.

Observation in R-studio:

However, to point out another interesting observation/difference, if you were to run in R-studio, there's an additional behaviour difference (from that of R64/R32 session) you may not be aware of.

The difference (in R studio) seems, to stem from how you run your code. That is, if you copy and paste all at once (as shown below, including the output):

system.time(a <- rep(1L, 10^8))
#    user  system elapsed 
#   0.256   0.263   0.526 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...
system.time(a[222L] <- 111L)
#    user  system elapsed 
#   0.299   0.199   0.498 
.Internal(inspect(a))
# @11f1d6000 13 INTSXP g0c7 [NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...
system.time(a[333L] <- 111L)
#    user  system elapsed 
#       0       0       0 
.Internal(inspect(a))
# @11f1d6000 13 INTSXP g1c7 [MARK,NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...

You see that the 2nd assignment doesn't involve a memory copy and time required is 0 seconds. Now, copy/paste/execute the same set of commands, but now one-by-one (hit enter after every line before typing the next). Here are the results:

system.time(a <- rep(1L, 10^8))
#    user  system elapsed 
#   0.256   0.265   0.588 
> 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

system.time(a[222L] <- 111L)
#    user  system elapsed 
#   0.302   0.204   0.559 

.Internal(inspect(a))
# @11f1d6000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

system.time(a[333L] <- 111L)
#    user  system elapsed 
#   0.296   0.208   0.504 
> 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

For the same syntax, here a copy is being made and the runtime is 0.5 seconds.

Now to explain the difference (as @MartinMorgan explained in his answer):

For the first case, being a NAM(2) SEXP object, it gets duplicated during assignment. However, this happens only once in the first case when you run all the lines at once. Also to note is that the second assignment has a MARK (unsigned int) which denotes "mark object as in use" (from the R-internals).

In the second case, in R-studio, hitting enter for every line results in each of these assignments returning back a NAM(2) SEXP object. And so, a copy is being made every single time.

like image 3
Arun Avatar answered Oct 21 '22 21:10

Arun