Usually, I use the functional form `:=`()
to compute multiple columns in a data.table
, thinking that this is the most efficient method. But I've recently discovered that it's slower than simply repeatedly using :=
. At least on my computer.
I'm guessing that there might be some overhead with the functional form of :=
but is that the entire reason why it's slower? I'm simply asking out of curiosity in order to understand the internals of data.table
better.
library(data.table)
n <- 5000000
dt <- data.table(a = rnorm(n),
b = rnorm(n),
c = rnorm(n))
dt_a <- copy(dt)
system.time({
dt_a[, d := a + b]
dt_a[, e := b + c]
dt_a[, f := a + c]
})
#> user system elapsed
#> 0.076 0.060 0.136
dt_b <- copy(dt)
system.time({
dt_b[, `:=`(d = a + b,
e = b + c,
f = a + c)]
})
#> user system elapsed
#> 0.096 0.116 0.211
One interesting property of this is that the time difference between :=
and `:=`()
is relative at about a factor of 1.5 to 2. If this was simply due to function overhead, as some suggest, I would suspect the time difference to be a fixed value?
library(data.table)
n <- 20000000
dt <- data.table(a = rnorm(n),
b = rnorm(n),
c = rnorm(n))
dt_a <- copy(dt)
system.time({
dt_a[, d := a + b]
dt_a[, e := b + c]
dt_a[, f := a + c]
})
#> user system elapsed
#> 0.163 0.208 0.371
dt_b <- copy(dt)
system.time({
dt_b[, `:=`(d = a + b,
e = b + c,
f = a + c)]
})
#> user system elapsed
#> 0.284 0.404 0.688
Equality operator == converts the data type temporarily to see if its value is equal to the other operand, whereas === (the identity operator) doesn't need to do any type casting and thus less work is done, which makes it faster than ==.
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables.
Why sets are faster than arrays. Most of the operations on javascript arrays, like insert, delete, search, etc are linear-time operations. They need O(n) time to complete where n is the size of the array. But since sets use keys to store elements, most of the operations take constant time O(1).
Judging the performance of programming languages, usually C is called the leader, though Fortran is often faster. New programming languages commonly use C as their reference and they are really proud to be only so much slower than C.
Some timings:
bench::mark(
chaining = DT0[, d := a + b][, e := b + c][, f := a + c],
assign = DT1[, c("d", "e", "f") := .(a+b, b+c, a+c)],
assign2 = DT1.1[, `:=` (d, a + b)][, `:=` (e, b + c)][, `:=` (f, a + c)],
use_set = {
set(DT2, NULL, "d", DT2[["a"]]+DT2[["b"]])
set(DT2, NULL, "e", DT2[["b"]]+DT2[["c"]])
set(DT2, NULL, "f", DT2[["a"]]+DT2[["c"]])
},
functional = DT3[, `:=`(d = a + b, e = b + c, f = a + c)]
)
timings and memory usage:
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 chaining 180ms 180ms 180ms 180ms 5.54 458MB 1 1 180ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
2 assign 320ms 320ms 320ms 320ms 3.12 916MB 1 1 320ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
3 assign2 188ms 188ms 188ms 188ms 5.33 458MB 1 1 188ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
4 use_set 322ms 323ms 323ms 323ms 3.10 916MB 0 2 645ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
5 functional 331ms 331ms 331ms 331ms 3.02 916MB 1 1 331ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
data:
library(data.table) #data.table_1.12.2
set.seed(0L)
n <- 2e7
DT <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT1.1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With