Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is data really copied four times in R's replacement functions?

Consider this variable

a = data.frame(x=1:5,y=2:6)

When I use a replacement function to change the first element of a, how many times is memory of the same size of a copied?

tracemem(a)
"change_first_element<-" = function(x, value) {
  x[1,1] = value
  return(x)
}
change_first_element(a) = 3
# tracemem[0x7f86028f12d8 -> 0x7f86028f1498]: 
# tracemem[0x7f86028f1498 -> 0x7f86028f1508]: change_first_element<- 
# tracemem[0x7f86028f1508 -> 0x7f8605762678]: [<-.data.frame [<- change_first_element<- 
# tracemem[0x7f8605762678 -> 0x7f8605762720]: [<-.data.frame [<- change_first_element<- 

There are four copy operations. I know that R doesn't mutate objects or pass by reference (yes, there are exceptions), but why are there four copies? Shouldn't one copy be enough?

Part 2:

If I call the replacement function differently, there are only three copy operations?

tracemem(a)
a = `change_first_element<-`(a,3)
# tracemem[0x7f8611f1d9f0 -> 0x7f8607327640]: change_first_element<- 
# tracemem[0x7f8607327640 -> 0x7f8607327758]: [<-.data.frame [<- change_first_element<- 
# tracemem[0x7f8607327758 -> 0x7f8607327800]: [<-.data.frame [<- change_first_element<-
like image 694
kdauria Avatar asked May 27 '14 21:05

kdauria


1 Answers

NOTE: Unless otherwise specified, all explanations below are valid for R versions < 3.1.0. There are great improvements made in R v3.1.0, which is also briefly touched upon here.

To answer your first question, "why four copies and shouldn't one be enough?", we'll begin by quoting the relevant part from R-internals first:

A 'named' value of 2, NAM(2), means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) A value of 0 means that it is known that no other SEXP shares data with this object, and so it may safely be altered.

A value of 1 is used for situations like dim(a) <- c(7, 2) where in principle two copies of a exist for the duration of the computation as (in principle) a <- dim<-(a, c(7, 2)) but for no longer, and so some primitive functions can be optimized to avoid a copy in this case.

NAM(1):

Let's start with NAM(1) objects. Here's an example:

x <- 1:5 # (1)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
# [1] "<0x10374ecc8>"

x[2L] <- 10L # (2)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(1),TR] (len=5, tl=0) 1,10,3,4,5

What's happening here? We created an integer vector using :, it being a primitive, resulted in a NAM(1) object. And when we used [<- on that object, the value got changed in-place (note that the pointers are identical, (1) and (2)). This is because [<- being a primitive knows quite well how to handle its inputs and is optimised for a no-copy in this scenario.

y = x # (3)
.Internal(inspect(x))
# @10374ecc8 13 INTSXP g0c3 [MARK,NAM(2),TR] (len=5, tl=0) 1,10,3,4,5

x[2L] <- 20L # (4)
.Internal(inspect(x))
# tracemem[0x10374ecc8 -> 0x10372f328]:
# @10372f328 13 INTSXP g0c3 [NAM(1),TR] (len=5, tl=0) 1,20,3,4,5

Now the same assignment results in a copy, why? By doing (3), the 'named' field gets incremented to NAM(2) as more than one object is pointing to the same data. Even if [<- is optimised, the fact that it's a NAM(2) means that the object must be duplicated. That's why it's now again a NAM(1) object after the assignment. That's because, call to duplicate sets named to 0 and the new assignment bumps it back to 1.

Note: Peter Dalgaard explains this case nicely in this link as to why x = 2L results in NAM(2) object.


NAM(2):

Now let's return to your question on calling *<- on a data.frame which is a NAM(2) object.

The first question then is, why is data.frame() a NAM(2) object? Why not a NAM(1) like the earlier case x <- 1:5? Duncan Murdoch answers this very nicely on the same post:

data.frame() is a plain R function, so it is treated no differently than any user-written function. On the other hand, the internal function that implements the : operator is a primitive, so it has complete control over its return value, and it can set NAMED in the most efficient way.

This means any attempt to change the value would result in triggering a duplicate (a deep copy). From ?tracemem:

... any copying of the object by the C function duplicate produces a message to standard output.

So a message from tracemem helps understand the number of copies. To understand the first line of your tracemem output, let's construct the function f<-, which does no actual replacement. Also, let's construct a data.frame big enough so that we can measure the time taken for a single copy of that data.frame.

## R v 3.0.3
`f<-` = function(x, value) {
    return(x) ## no actual replacement
}

df <- data.frame(x=1:1e8, y=1:1e8) # 762.9 Mb
tracemem(df) # [1] "<0x7fbccd2f4ae8>"

require(data.table)
system.time(copy(df)) 
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4ff0]: copy system.time 
#   user  system elapsed 
#  0.609   0.484   1.106 

system.time(f(df) <- 3)
# tracemem[0x7fbccd2f4ae8 -> 0x7fbccd2f4f10]: system.time 
#   user  system elapsed 
#  0.608   0.480   1.101 

I've used the function copy() from data.table (which basically calls the C duplicate function). The times for copying are more or less identical. So, the first step is clearly a deep copy, even if it did nothing.

This explains the first two verbose messages from tracemem in your post:

(1) From the global environment we called f(df) <- 3). Here's one copy.
(2) From within the function f<-, another assignment x[1,1] <- 3 which'll call the [<- (and hence the [<-.data.frame function). That makes the second copy immediately.

Finding the rest of the copies is easy with a debugonce() on [<-.data.frame. That is, doing:

debugonce(`[<-`)
df <- data.frame(x=1:1e8, y=1:1e8)
`f<-` = function(x, value) {
    x[1,1] = value
    return(x)
}
tracemem(df)
f(df) = 3

# first three lines:

# tracemem[0x7f8ba33d8a08 -> 0x7f8ba33d8d50]:      (1)
# tracemem[0x7f8ba33d8d50 -> 0x7f8ba33d8a78]: f<-  (2)
# debugging in: `[<-.data.frame`(`*tmp*`, 1L, 1L, value = 3L)

By hitting enter, you'll find the other two copies to be inside this function:

# debug: class(x) <- NULL
# tracemem[0x7f8ba33d8a78 -> 0x7f8ba3cd6078]: [<-.data.frame [<- f<-     (3)

# debug: x[[jj]][iseq] <- vjj
# tracemem[0x7f8ba3cd6078 -> 0x7f882c35ed40]: [<-.data.frame [<- f<-     (4)

Note that class is primitive but it's being called on a NAM(2) object. I suspect that's the reason for the copy there. And the last copy is inevitable as it modifies the column.

So, there you go.


Now a small note on R v3.1.0:

I also tested the same in R V3.1.0. tracemem provides all four lines. However, the only time-consuming step is (4). IIUC, the remaining cases, all due to [<- / class<- should be triggering a shallow copy instead of deep copy. What's awesome is that, even in (4), only that column that's being modified seems to be deep copied. R 3.1.0 has great improvements!

This means tracemem provides output due to shallow copy too - which is a bit confusing since the documentation doesn't explicitly state that and makes it hard to tell between a shallow and deep copy, except by measuring time. Perhaps it's my (incorrect) understanding. Feel free to correct me.


On your part 2, I'll quote Luke Tierney from here:

Calling a foo<- function directly is not a good idea unless you really understand what is going on in the assignment mechanism in general and in the particular foo<- function. It is definitely not something to be done in routine programming unless you like unpleasant surprises.

But I am unable to tell if these unpleasant surprises extend to an object that's already NAM(2). Because, Matt was calling it on a list, which is a primitive and therefore NAM(1), and calling foo<- directly wouldn't increment it's 'named' value.

But, the fact that R v3.1.0 has great improvements should already convince you that such a function call is not necessary anymore.

HTH.

PS: Feel free to correct me (and help me shorten this answer if possible) :).


Edit: I seem to have missed the point about a copy being reduced when calling f<- directly as spotted under comment. It's pretty easy to see by using the function Simon Urbanek used in the post (that's linked multiple times now):

# rm(list=ls()) # to make sure there' no other object in your workspace
`f<-` <- function(x, value) {
    print(ls(env = parent.frame()))
}

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce01a65358>"

f(df) = 3
# tracemem[0x7fce0359b2a0 -> 0x7fce0359ae08]: 
# [1] "*tmp*" "df"    "f<-"  

df <- data.frame(x=1, y=2)
tracemem(df) # [1] "<0x7fce03c505c0>"
df <- `f<-`(df, 3)
# [1] "df"  "f<-"

As you can see, in the first method there's an object *tmp* that's being created, which is not, in the second case. And it seems like this creation of *tmp* object for a NAM(2) input object triggers a copy of the input before *tmp* gets assigned to the function argument. But that's as far as my understanding goes.

like image 113
Arun Avatar answered Nov 27 '22 15:11

Arun