Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest method to replace data values conditionally in data.table (speed comparison)

Tags:

r

data.table

Why the second method will become slower by increasing the data.table size:

library(data.table)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

1:

DF1=DF2=DF

system.time(DF[y==6,"y"]<-10)
user  system elapsed 
2.793   0.699   3.497 

2:

system.time(DF1$y[DF1$y==6]<-10)
user  system elapsed 
6.525   1.555   8.107 

3:

system.time(DF2[y==6, y := 10]) # slowest!
user  system elapsed 
7.925   0.626   8.569 

>sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

Is there any faster way to do this?

like image 984
user1436187 Avatar asked Aug 13 '15 13:08

user1436187


1 Answers

On your last case, it is a consequence of auto-indexing feature in data.table, since v1.9.4+. Read more for the full picture :-).

When you do DT[col == .] or DT[col %in% .], an index is being generated automatically on your first run. The index is just the order of the column you specify. The computation of indices is quite fast (using counting sort / true radix sorting).

The table is 120 million rows and it takes roughly:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(data.table:::forderv(DF, "y"))
#   3.923   0.736   4.712 

Side note: Column y need not be really double (on which ordering takes longer). If we convert it to integer type:

   DF[, y := as.integer(y)]
   system.time(data.table:::forderv(DF, "y"))
   #    user  system elapsed 
   #   0.569   0.140   0.717 

The advantage is that any subsequent subsets on that column using == or %in% will be blazing fast (Slides, R script, video of Matt's presentation). For example:

# clean session, copy/paste code from above to create DF
system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.750   1.121   5.932 

system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.002   0.907   4.969 

Oh wait a minute.. it isn't fast. But.. indexing..?!? We're replacing the same column every time with a new value. This results in the order of that column getting changed (thereby removing the index). Let's subsetting y, but modifying v:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   4.653   1.071   5.765 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   0.685   0.213   0.910 

options(datatable.verbose=TRUE)
system.time(DF[y==6, v := 10L])
# Using existing index 'y'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: v 
# Assigning to 40000059 row subset of 120000000 rows
#    user  system elapsed 
#   0.683   0.221   0.914 

You can see that the time to compute the indices (using binary search) takes 0 seconds. Also check ?set2key().

If you're not going to do repeated subsetting, or as in your case, subsetting and modifying the same column, then it makes sense to disable the feature by doing options(datatable.auto.index = FALSE), filed #1264:

# clean session
require(data.table)
options(datatable.auto.index = FALSE) # disable auto indexing
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.067   0.274   1.367 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.100   0.314   1.443 

The difference isn't much here. The time to vector scan is system.time(DF$y == 6) = 0.448s.

To sum up, in your case, vector scan makes more sense. But in general, the idea is that it's better to pay the penalty once and have fast results on future subsets on that column, rather than vector scanning each and every time.

Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using fsort(), and I guess the time spent there might be the reason the timings are quite close, filed #1265).


As to your second case being slow, not quite sure why. I suspect it might be due to unnecessary copies from R's part. What version of R are you using? For the future, always post your sessionInfo() output.

like image 107
Arun Avatar answered Nov 08 '22 01:11

Arun