For some reason, this operation seems to show data.table assigning a new column about half as fast as base R. Is there a reason for this?
require(microbenchmark)
require(data.table)
DT = data.table(a = runif(1000000), b = rnorm(1000000))
DF = data.frame(a = runif(1000000), b = rnorm(1000000))
microbenchmark(
DT[,keycol := seq(1,nrow(DT))],
DF$keycol <- seq(1,nrow(DF)),
times = 2)
Unit: microseconds
expr min lq mean median uq max neval
DT[, `:=`(keycol, seq(1, nrow(DT)))] 901.109 901.109 921.1220 921.1220 941.135 941.135 2
DF$keycol <- seq(1, nrow(DF)) 487.844 487.844 527.1865 527.1865 566.529 566.529 2
Here is my R version, using data.table version 1.10.4:
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.3
year 2017
month 03
day 06
svn rev 72310
language R
version.string R version 3.3.3 (2017-03-06)
nickname Another Canoe
It's hard to benchmark tasks that run in ms/us range usually since it's hard to measure the actual time correctly where minor perturbations can influence the results greatly.
I don't generally use any benchmarking packages. And I'd advice not to use any particularly for update by reference type operations. Also, in running your code (with times=100L
), while I got the identical timing differences when running them together, the timings were more or less identical when I benchmarked DT
and DF
code separately. No idea why.
Therefore, I'd suggest running something like this:
require(data.table)
set.seed(1L)
N <- 1e6L
DT = data.table(a = runif(N), b = rnorm(N))
DF = data.frame(a = runif(N), b = rnorm(N))
runs <- 100:51
t_dt <- sapply(runs, function(k) {
# cat("k=",k,"\n",sep="")
DTlist <- lapply(1:k, function(x) copy(DT))
t0 = proc.time()
for (i in 1:k) DTlist[[i]][, keycol := seq(1, nrow(DT))]
(proc.time()-t0)[["elapsed"]]
})
t_df <- sapply(runs, function(k) {
# cat("k=",k,"\n",sep="")
DFlist <- lapply(1:k, function(x) copy(DF))
t0 = proc.time()
for (i in 1:k) DFlist[[i]]$keycol <- seq(1, nrow(DF))
(proc.time()-t0)[["elapsed"]]
})
I've also kept the
for
loop size variable so as to average out variations that might arise due to difference in the number of runs. I've kept the lower limit to 51 since it seemed like a reasonably large number where the benchmark results would be meaningful (and more importantly for me to not run out of patience :-)).
We can directly call the C function assign
which is the actual function that updates by reference to avoid any other effects that contribute to runtime including [.data.table
call overhead. This will serve as the baseline.
t_dt_base <- sapply(runs, function(k) {
# cat("k=",k,"\n",sep="")
DTlist <- lapply(1:k, function(x) copy(DT))
t0 = proc.time()
for (i in 1:k) .Call("Cassign", DTlist[[i]], NULL, 3L, "keycol", list(seq(1, nrow(DT))), FALSE)
(proc.time()-t0)[["elapsed"]]
})
ans <- data.table(dt=t_dt/(runs), df=t_df/(runs), dt_base=t_dt_base/(runs)) # average within runs
# fwrite(ans, "timings.csv", sep=",")
(t_mean <- sapply(ans, mean)) # average across runs
# dt df dt_base
# 0.003250907 0.002789930 0.002735729
The baseline runtime (from direct call to assign
) is more or less the same as df
. However, there's a difference of 0.000515178
seconds between dt
and baseline, which we could chalk it up to [.data.table
overhead (and probably the [[
access of list ).
Running with N <- 1e7L
and runs <- 10:5
returns:
dt df dt_base
0.01697659 0.01468419 0.01479067
which results in a difference of 0.00218592
(>> 0.0005). It seems to me that there are other factors that depend on the size of data.table (?) that seems to contribute to runtime... I don't have time to investigate that ATM. But, hope this helps a bit.
PS: While working on this Q, I found out that there's a (deep) copy that could be avoided in scenarios like this:
x <- 1:5
.Internal(inspect(x))
# @7fdde2a4ed20 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
tracemem(x)
dt <- data.table(a=1:5, b=6:10)
dt[, c := x] # 'x' is deep copied here but should be avoided
This is because this line results in NAM(1) incrementing to NAM(2) (i.e., two symbols are bound to the value now). And data.table internally checks this and makes a deep copy if NAM(2). This could be avoided probably. I'll file an issue ASAP.
NB: this was run in R-console (from iTerm). RStudio seems to create NAM(2) by default even for vectors, which is strange, and am not sure why. But that does mean that even if we fix this case, RStudio will still deep copy.
I, too, am pretty impressed with how large the difference is... I guess it's the fault of the overhead of [.data.table
By the way, you're not doing your benchmarking properly -- a more even-footed comparison would not be overwriting the column some of the time, but starting from scratch each time like so:
set.seed(102340)
times = matrix(nrow = 500, ncol = 2)
colnames(times) = c('DT', 'DF')
for (ii in seq_len(nrow(times))) {
DT = data.table(a = runif(1000000), b = rnorm(1000000))
DF = data.frame(a = runif(1000000), b = rnorm(1000000))
TT0 = get_nanotime()
DT[ , keycol := seq(1, nrow(DT))]
TT1 = get_nanotime()
delDT = TT1 - TT0
TT0 = get_nanotime()
DF$keycol <- seq(1,nrow(DF))
TT1 = get_nanotime()
delDF = TT1 - TT0
times[ii, ] = c(delDT, delDF)
}
summary(times)
# DT DF
# Min. : 1617687 Min. : 420502
# 1st Qu.: 2205314 1st Qu.: 447691
# Median : 3297872 Median : 464019
# Mean : 5277059 Mean : 594214
# 3rd Qu.: 4291291 3rd Qu.: 578034
# Max. :75731819 Max. :2224713
Faster in either approach is using seq_len(nrow(DT))
instead of seq(1, nrow(DT))
.
A decent part of the difference seems to be chalked up to the overhead of [.data.table
:
set.seed(102340)
ns = as.integer(10^(1:7))
ratios = numeric(length(ns))
for (nn in seq_along(ns)) {
times = matrix(nrow = 500L, ncol = 2L)
for (ii in seq_len(nrow(times))) {
DT = data.table(a = runif(ns[nn]),
b = rnorm(ns[nn]))
DF = data.frame(a = runif(ns[nn]),
b = rnorm(ns[nn]))
TT0.1 = get_nanotime()
DT[ , keycol := seq_len(nrow(DT))]
TT1.1 = get_nanotime()
delDT = TT1.1 - TT0.1
TT0.2 = get_nanotime()
DF$keycol <- seq(1,nrow(DF))
TT1.2 = get_nanotime()
delDF = TT1.2 - TT0.2
times[ii, ] = c(delDT, delDF)
}
ratios[nn] = median(times[ , 1L])/median(times[ , 2L])
print(ratios)
}
plot(log10(ns), ratios, type = 'b', lwd = 3L, xaxt = 'n',
xlab = '# Rows', ylab = 'DT time / DF time',
main = 'Ratio of DT assignment time\nvs. DF Assignment Time')
axis(side = 1L, at = 1:7, labels = ns)
abline(h = 1, lty = 2L, col = 'red')
Timings become comparable when N gets larger.
require(microbenchmark)
require(data.table)
N <- 1e7
DT = data.table(a = runif(N), b = rnorm(N))
DF = data.frame(a = runif(N), b = rnorm(N))
#force(DT)
ans <- capture.output(microbenchmark(
DT[,keycol := seq_len(.N)],
DT$keycol <- seq_len(nrow(DT)), #as mentioned in vignette, this is slow
DT[["keycol"]] <- seq_len(nrow(DT)),
DT[,"keycol"] <- seq_len(nrow(DT)),
DF$keycol <- seq_len(nrow(DF)),
DF[["keycol"]] <- seq_len(nrow(DF)),
DF[,"keycol"] <- seq_len(nrow(DF)),
times = 20L))
message(paste0("#",ans,"\n"))
#Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, `:=`(keycol, seq_len(.N))] 16.1415 16.96355 29.26518 17.35340 21.91285 232.7037 20
# DT$keycol <- seq_len(nrow(DT)) 233.7527 291.84105 385.04133 419.14105 451.05655 469.3172 20
# DT[["keycol"]] <- seq_len(nrow(DT)) 15.5652 16.41960 18.81244 16.99350 20.12640 35.2602 20
# DT[, "keycol"] <- seq_len(nrow(DT)) 134.1463 136.92965 197.58160 166.53125 206.34465 394.7461 20
# DF$keycol <- seq_len(nrow(DF)) 14.5780 16.33775 19.65723 17.04340 22.78940 39.9137 20
# DF[["keycol"]] <- seq_len(nrow(DF)) 14.4700 16.11845 38.83084 16.49010 22.83845 220.2109 20
# DF[, "keycol"] <- seq_len(nrow(DF)) 15.1030 16.45990 26.03781 16.97035 21.90650 137.9879 20
R specs:
sessionInfo()
#R version 3.3.2 (2016-10-31)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows 7 x64 (build 7601) Service Pack 1
#
#locale:
#[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C LC_TIME=English_Singapore.1252
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] data.table_1.10.0 microbenchmark_1.4-2.1
#
#loaded via a namespace (and not attached):
# [1] Rcpp_0.12.8 assertthat_0.1 grid_3.3.2 R6_2.2.0 plyr_1.8.4 gtable_0.2.0 magrittr_1.5 scales_0.4.1
# [9] ggplot2_2.2.1 httr_1.2.1 lazyeval_0.2.0 rstudioapi_0.6 tools_3.3.2 munsell_0.4.3 RStudioShortKeys_0.1.0 colorspace_1.3-2
#[17] tibble_1.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With