Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"update by reference" vs shallow copy

Tags:

r

data.table

The function set or the expression := inside [.data.table allows user to update data.tables by reference. How does this behavior differ from reassigning the result of an operation to the original data.frame?

keepcols<-function(DF,cols){
  eval.parent(substitute(DF<-DF[,cols,with=FALSE]))  
}
keeprows<-function(DF,i){
   eval.parent(substitute(DF<-DF[i,]))
}

Because the RHS in the expression <- is a shallow copy of the initial dataframe in recent versions of R, these functions seem pretty efficient. How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use? When is the difference most sizable?

Some (speed) benchmarks. It seems that the speed difference is negligible when the dataset has only two variables, and get bigger with more variables.

library(data.table)

# Long dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),     
   v1 =  sample(5, N, TRUE)                                         
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
 0.060   0.013   0.077 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
0.044   0.010   0.060 


system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
0.132   0.025   0.161  
system.time(DT <- DT[,list(id1,v1)])
user  system elapsed 
0.124   0.026   0.153 


# Wide dataset
N=1e7; K=100
DT <- data.table(
  id1 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id2 = sample(sprintf("id%03d",1:K), N, TRUE),      
  id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), 
   v1 =  sample(5, N, TRUE),                          
   v2 =  sample(1e6, N, TRUE),                        
   v3 =  sample(round(runif(100,max=100),4), N, TRUE)                    
)
system.time(DT[,a_inplace:=mean(v1)])
 user  system elapsed 
  0.057   0.014   0.089 
system.time(DT[,a_inplace:=NULL])
 user  system elapsed 
  0.038   0.009   0.061 

system.time(DT <- DT[,c(.SD,a_usual=mean(v1)),.SDcols=names(DT)])
user  system elapsed 
2.483   0.146   2.602 
system.time(DT <- DT[,list(id1,id2,id3,v1,v2,v3)])
 user  system elapsed 
 1.143   0.088   1.220 
like image 413
Matthew Avatar asked Sep 20 '14 04:09

Matthew


1 Answers

In data.table, := and all set* functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.


It's a wordy/lengthy answer, but I think this answers your first two questions:

How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?

In base R v3.1.0+ when we do:

DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
  1. From DF1 to DF2, both columns are only shallow copied.
  2. From DF2 to DF3 the column y alone had to be copied/re-allocated, but x gets shallow copied again.
  3. From DF2 to DF4, same as (2).

That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.

In data.table, we modify in-place. Meaning even during DF3 and DF4 column y doesn't get copied.

DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]

Here, since y is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.

This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table.

Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric type, from say, character type:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:

DF[] = lapply(DF, as.numeric)

All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF takes a space of:

10 * 100e6 * 4 / 1024^3 = ~ 3.7GB

And since numeric type is twice as much in size, we'll need a total of 7.4GB + 3.7GB of space for us to make the conversion using base R.

But note that data.table copies during DF1 to DF2. That is:

DT2 = DT1[, c("x", "y")]

results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.

What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.


For your last question:

"When is the difference most sizeable?"

  1. There are still people who have to use older versions of R, where deep copies can't be avoided.

  2. It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.

  3. There are cases like this where shallow copying won't benefit.

  4. When you'd like to update columns of a data.frame for each group, and there are too many groups.

  5. When you'd like to update a column of say, data.table DT1 based on a join with another data.table DT2 - this can be done as:

    DT1[DT2, col := i.val]
    

    where i. refers to the value from val column of DT2 (the i argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.

All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.

Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).

like image 123
Arun Avatar answered Oct 19 '22 00:10

Arun