Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table speed is slower when assigning a column

Tags:

r

data.table

For some reason, this operation seems to show data.table assigning a new column about half as fast as base R. Is there a reason for this?

require(microbenchmark)
require(data.table)
DT = data.table(a = runif(1000000), b = rnorm(1000000))
DF = data.frame(a = runif(1000000), b = rnorm(1000000))

microbenchmark(
  DT[,keycol := seq(1,nrow(DT))],
  DF$keycol <- seq(1,nrow(DF)),
times = 2)

Unit: microseconds
expr                                   min      lq      mean    median    uq     max     neval
 DT[, `:=`(keycol, seq(1, nrow(DT)))] 901.109 901.109 921.1220 921.1220 941.135 941.135     2
 DF$keycol <- seq(1, nrow(DF))        487.844 487.844 527.1865 527.1865 566.529 566.529     2

Here is my R version, using data.table version 1.10.4:

> version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.3                         
year           2017                        
month          03                          
day            06                            
svn rev        72310                       
language       R                           
version.string R version 3.3.3 (2017-03-06)
nickname       Another Canoe    
like image 551
Allen Wang Avatar asked Apr 19 '17 21:04

Allen Wang


3 Answers

It's hard to benchmark tasks that run in ms/us range usually since it's hard to measure the actual time correctly where minor perturbations can influence the results greatly.

I don't generally use any benchmarking packages. And I'd advice not to use any particularly for update by reference type operations. Also, in running your code (with times=100L), while I got the identical timing differences when running them together, the timings were more or less identical when I benchmarked DT and DF code separately. No idea why.

Therefore, I'd suggest running something like this:

require(data.table)
set.seed(1L)
N <- 1e6L
DT = data.table(a = runif(N), b = rnorm(N))
DF = data.frame(a = runif(N), b = rnorm(N))

runs <- 100:51
t_dt <- sapply(runs, function(k) {
  # cat("k=",k,"\n",sep="")
  DTlist <- lapply(1:k, function(x) copy(DT))

  t0 = proc.time()
  for (i in 1:k) DTlist[[i]][, keycol := seq(1, nrow(DT))]
  (proc.time()-t0)[["elapsed"]]
})

t_df <- sapply(runs, function(k) {
  # cat("k=",k,"\n",sep="")
  DFlist <- lapply(1:k, function(x) copy(DF))
  t0 = proc.time()
  for (i in 1:k) DFlist[[i]]$keycol <- seq(1, nrow(DF))
  (proc.time()-t0)[["elapsed"]]
})

I've also kept the for loop size variable so as to average out variations that might arise due to difference in the number of runs. I've kept the lower limit to 51 since it seemed like a reasonably large number where the benchmark results would be meaningful (and more importantly for me to not run out of patience :-)).

We can directly call the C function assign which is the actual function that updates by reference to avoid any other effects that contribute to runtime including [.data.table call overhead. This will serve as the baseline.

t_dt_base <- sapply(runs, function(k) {
  # cat("k=",k,"\n",sep="")
  DTlist <- lapply(1:k, function(x) copy(DT))

  t0 = proc.time()
  for (i in 1:k) .Call("Cassign", DTlist[[i]], NULL, 3L, "keycol", list(seq(1, nrow(DT))), FALSE)
  (proc.time()-t0)[["elapsed"]]
})

ans <- data.table(dt=t_dt/(runs), df=t_df/(runs), dt_base=t_dt_base/(runs)) # average within runs
# fwrite(ans, "timings.csv", sep=",")
(t_mean <- sapply(ans, mean)) # average across runs
#          dt          df     dt_base 
# 0.003250907 0.002789930 0.002735729

The baseline runtime (from direct call to assign) is more or less the same as df. However, there's a difference of 0.000515178 seconds between dt and baseline, which we could chalk it up to [.data.table overhead (and probably the [[ access of list ).

Running with N <- 1e7L and runs <- 10:5 returns:

        dt         df    dt_base
0.01697659 0.01468419 0.01479067

which results in a difference of 0.00218592 (>> 0.0005). It seems to me that there are other factors that depend on the size of data.table (?) that seems to contribute to runtime... I don't have time to investigate that ATM. But, hope this helps a bit.


PS: While working on this Q, I found out that there's a (deep) copy that could be avoided in scenarios like this:

x <- 1:5
.Internal(inspect(x))
# @7fdde2a4ed20 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
dt <- data.table(a=1:5, b=6:10)
dt[, c := x] # 'x' is deep copied here but should be avoided

This is because this line results in NAM(1) incrementing to NAM(2) (i.e., two symbols are bound to the value now). And data.table internally checks this and makes a deep copy if NAM(2). This could be avoided probably. I'll file an issue ASAP.

NB: this was run in R-console (from iTerm). RStudio seems to create NAM(2) by default even for vectors, which is strange, and am not sure why. But that does mean that even if we fix this case, RStudio will still deep copy.

like image 94
Arun Avatar answered Nov 06 '22 01:11

Arun


I, too, am pretty impressed with how large the difference is... I guess it's the fault of the overhead of [.data.table

By the way, you're not doing your benchmarking properly -- a more even-footed comparison would not be overwriting the column some of the time, but starting from scratch each time like so:

set.seed(102340)

times = matrix(nrow = 500, ncol = 2)
colnames(times) = c('DT', 'DF')
for (ii in seq_len(nrow(times))) {
  DT = data.table(a = runif(1000000), b = rnorm(1000000))
  DF = data.frame(a = runif(1000000), b = rnorm(1000000))

  TT0 = get_nanotime()
  DT[ , keycol := seq(1, nrow(DT))]
  TT1 = get_nanotime()
  delDT = TT1 - TT0

  TT0 = get_nanotime()
  DF$keycol <- seq(1,nrow(DF))
  TT1 = get_nanotime()
  delDF = TT1 - TT0
  times[ii, ] = c(delDT, delDF)
}
summary(times)
 #       DT                 DF         
 # Min.   : 1617687   Min.   : 420502  
 # 1st Qu.: 2205314   1st Qu.: 447691  
 # Median : 3297872   Median : 464019  
 # Mean   : 5277059   Mean   : 594214  
 # 3rd Qu.: 4291291   3rd Qu.: 578034  
 # Max.   :75731819   Max.   :2224713 

Faster in either approach is using seq_len(nrow(DT)) instead of seq(1, nrow(DT)).

A decent part of the difference seems to be chalked up to the overhead of [.data.table:

set.seed(102340)

ns = as.integer(10^(1:7))
ratios = numeric(length(ns))
for (nn in seq_along(ns)) {
  times = matrix(nrow = 500L, ncol = 2L)
  for (ii in seq_len(nrow(times))) {
    DT = data.table(a = runif(ns[nn]), 
                    b = rnorm(ns[nn]))
    DF = data.frame(a = runif(ns[nn]), 
                    b = rnorm(ns[nn]))

    TT0.1 = get_nanotime()
    DT[ , keycol := seq_len(nrow(DT))]
    TT1.1 = get_nanotime()
    delDT = TT1.1 - TT0.1

    TT0.2 = get_nanotime()
    DF$keycol <- seq(1,nrow(DF))
    TT1.2 = get_nanotime()
    delDF = TT1.2 - TT0.2

    times[ii, ] = c(delDT, delDF)
  }
  ratios[nn] = median(times[ , 1L])/median(times[ , 2L])
  print(ratios)
}

plot(log10(ns), ratios, type = 'b', lwd = 3L, xaxt = 'n',
     xlab = '# Rows', ylab = 'DT time / DF time',
     main = 'Ratio of DT assignment time\nvs. DF Assignment Time')
axis(side = 1L, at = 1:7, labels = ns)
abline(h = 1, lty = 2L, col = 'red')

enter image description here

like image 40
MichaelChirico Avatar answered Nov 06 '22 02:11

MichaelChirico


Timings become comparable when N gets larger.

require(microbenchmark)
require(data.table)
N <- 1e7
DT = data.table(a = runif(N), b = rnorm(N))
DF = data.frame(a = runif(N), b = rnorm(N))
#force(DT)

ans <- capture.output(microbenchmark(
    DT[,keycol := seq_len(.N)],
    DT$keycol <- seq_len(nrow(DT)),     #as mentioned in vignette, this is slow
    DT[["keycol"]] <- seq_len(nrow(DT)),
    DT[,"keycol"] <- seq_len(nrow(DT)),

    DF$keycol <- seq_len(nrow(DF)),
    DF[["keycol"]] <- seq_len(nrow(DF)),
    DF[,"keycol"] <- seq_len(nrow(DF)),

    times = 20L))
message(paste0("#",ans,"\n"))

#Unit: milliseconds
#                                expr      min        lq      mean    median        uq      max neval
#     DT[, `:=`(keycol, seq_len(.N))]  16.1415  16.96355  29.26518  17.35340  21.91285 232.7037    20
#      DT$keycol <- seq_len(nrow(DT)) 233.7527 291.84105 385.04133 419.14105 451.05655 469.3172    20
# DT[["keycol"]] <- seq_len(nrow(DT))  15.5652  16.41960  18.81244  16.99350  20.12640  35.2602    20
# DT[, "keycol"] <- seq_len(nrow(DT)) 134.1463 136.92965 197.58160 166.53125 206.34465 394.7461    20
#      DF$keycol <- seq_len(nrow(DF))  14.5780  16.33775  19.65723  17.04340  22.78940  39.9137    20
# DF[["keycol"]] <- seq_len(nrow(DF))  14.4700  16.11845  38.83084  16.49010  22.83845 220.2109    20
# DF[, "keycol"] <- seq_len(nrow(DF))  15.1030  16.45990  26.03781  16.97035  21.90650 137.9879    20

R specs:

sessionInfo()

#R version 3.3.2 (2016-10-31)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 7 x64 (build 7601) Service Pack 1
#
#locale:
#[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C                       LC_TIME=English_Singapore.1252    
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#[1] data.table_1.10.0      microbenchmark_1.4-2.1
#
#loaded via a namespace (and not attached):
# [1] Rcpp_0.12.8            assertthat_0.1         grid_3.3.2             R6_2.2.0               plyr_1.8.4             gtable_0.2.0           magrittr_1.5           scales_0.4.1          
# [9] ggplot2_2.2.1          httr_1.2.1             lazyeval_0.2.0         rstudioapi_0.6         tools_3.3.2            munsell_0.4.3          RStudioShortKeys_0.1.0 colorspace_1.3-2      
#[17] tibble_1.2            
like image 1
chinsoon12 Avatar answered Nov 06 '22 01:11

chinsoon12