Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When should I use the := operator in data.table?

data.table objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?

like image 470
Ari B. Friedman Avatar asked Aug 11 '11 17:08

Ari B. Friedman


People also ask

What is := in data table?

a) := for its side effect Note that the new column speed has been added to flights data. table. This is because := performs operations by reference. Since DT (the function argument) and flights refer to the same object in memory, modifying DT also reflects on flights .

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.

What does data table function do in R?

data. table is a package is used for working with tabular data in R. It provides the efficient data. table object which is a much improved version of the default data.

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.


1 Answers

Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame but doesn't copy the entire table each time.

m = matrix(1,nrow=100000,ncol=100) DF = as.data.frame(m) DT = as.data.table(m)  system.time(for (i in 1:1000) DF[i,1] <- i)      user  system elapsed    287.062 302.627 591.984   system.time(for (i in 1:1000) DT[i,V1:=i])      user  system elapsed      1.148   0.000   1.158     ( 511 times faster ) 

Putting the := in j like that allows more idioms :

DT["a",done:=TRUE]   # binary search for group 'a' and set a flag DT[,newcol:=42]      # add a new column by reference (no copy of existing data) DT[,col:=NULL]       # remove a column by reference 

and :

DT[,newcol:=sum(v),by=group]  # like a fast transform() by group 

I can't think of any reasons to avoid := ! Other than, inside a for loop. Since := appears inside DT[...], it comes with the small overhead of the [.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such as i, by, nomatch etc. So for inside for loops, there is a low overhead, direct version of := called set. See ?set for more details and examples. The disadvantages of set include that i must be row numbers (no binary search) and you can't combine it with by. By making those restrictions set can reduce the overhead dramatically.

system.time(for (i in 1:1000) set(DT,i,"V1",i))      user  system elapsed      0.016   0.000   0.018 
like image 164
Matt Dowle Avatar answered Oct 21 '22 22:10

Matt Dowle