Usually, I use the functional form <code>`:=`()</code> to compute multiple columns in a <code>data.table</code>, thinking that this is the most efficient method. But I've recently discovered that it's slower than simply repeatedly using <code>:=</code>. At least on my computer. I'm guessing that there might be some overhead with the functional form of <code>:=</code> but is that the entire reason why it's slower? I'm simply asking out of curiosity in order to understand the internals of <code>data.table</code> better. <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) n <- 5000000 dt <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n)) dt_a <- copy(dt) system.time({ dt_a[, d := a + b] dt_a[, e := b + c] dt_a[, f := a + c] }) #> user system elapsed #> 0.076 0.060 0.136 dt_b <- copy(dt) system.time({ dt_b[, `:=`(d = a + b, e = b + c, f = a + c)] }) #> user system elapsed #> 0.096 0.116 0.211 </code></pre> <h3>Update:</h3> One interesting property of this is that the time difference between <code>:=</code> and <code>`:=`()</code> is relative at about a factor of 1.5 to 2. If this was simply due to function overhead, as some suggest, I would suspect the time difference to be a fixed value? <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) n <- 20000000 dt <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n)) dt_a <- copy(dt) system.time({ dt_a[, d := a + b] dt_a[, e := b + c] dt_a[, f := a + c] }) #> user system elapsed #> 0.163 0.208 0.371 dt_b <- copy(dt) system.time({ dt_b[, `:=`(d = a + b, e = b + c, f = a + c)] }) #> user system elapsed #> 0.284 0.404 0.688 </code></pre>

Some timings: <pre class="prettyprint"><code>bench::mark( chaining = DT0[, d := a + b][, e := b + c][, f := a + c], assign = DT1[, c("d", "e", "f") := .(a+b, b+c, a+c)], assign2 = DT1.1[, `:=` (d, a + b)][, `:=` (e, b + c)][, `:=` (f, a + c)], use_set = { set(DT2, NULL, "d", DT2[["a"]]+DT2[["b"]]) set(DT2, NULL, "e", DT2[["b"]]+DT2[["c"]]) set(DT2, NULL, "f", DT2[["a"]]+DT2[["c"]]) }, functional = DT3[, `:=`(d = a + b, e = b + c, f = a + c)] ) </code></pre> timings and memory usage: <pre class="prettyprint"><code> expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list> 1 chaining 180ms 180ms 180ms 180ms 5.54 458MB 1 1 180ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~ 2 assign 320ms 320ms 320ms 320ms 3.12 916MB 1 1 320ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~ 3 assign2 188ms 188ms 188ms 188ms 5.33 458MB 1 1 188ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~ 4 use_set 322ms 323ms 323ms 323ms 3.10 916MB 0 2 645ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~ 5 functional 331ms 331ms 331ms 331ms 3.02 916MB 1 1 331ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~ </code></pre> data: <pre class="prettyprint"><code>library(data.table) #data.table_1.12.2 set.seed(0L) n <- 2e7 DT <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n)) DT0 <- copy(DT) DT1 <- copy(DT) DT1.1 <- copy(DT) DT2 <- copy(DT) DT3 <- copy(DT) </code></pre>

Why is := faster than `:=`()?

Tags:

r

data.table

Usually, I use the functional form `:=`() to compute multiple columns in a data.table, thinking that this is the most efficient method. But I've recently discovered that it's slower than simply repeatedly using :=. At least on my computer.

I'm guessing that there might be some overhead with the functional form of := but is that the entire reason why it's slower? I'm simply asking out of curiosity in order to understand the internals of data.table better.

Click to copy

library(data.table)


n <- 5000000
dt <- data.table(a = rnorm(n),
                 b = rnorm(n),
                 c = rnorm(n))

dt_a <- copy(dt)

system.time({
  dt_a[, d := a + b]
  dt_a[, e := b + c]
  dt_a[, f := a + c]
})
#>    user  system elapsed 
#>   0.076   0.060   0.136

dt_b <- copy(dt)

system.time({
  dt_b[, `:=`(d = a + b,
              e = b + c,
              f = a + c)]
})
#>    user  system elapsed 
#>   0.096   0.116   0.211

Update:

One interesting property of this is that the time difference between := and `:=`() is relative at about a factor of 1.5 to 2. If this was simply due to function overhead, as some suggest, I would suspect the time difference to be a fixed value?

Click to copy

library(data.table)


n <- 20000000
dt <- data.table(a = rnorm(n),
                 b = rnorm(n),
                 c = rnorm(n))

dt_a <- copy(dt)

system.time({
  dt_a[, d := a + b]
  dt_a[, e := b + c]
  dt_a[, f := a + c]
})
#>    user  system elapsed 
#>   0.163   0.208   0.371

dt_b <- copy(dt)

system.time({
  dt_b[, `:=`(d = a + b,
              e = b + c,
              f = a + c)]
})
#>    user  system elapsed 
#>   0.284   0.404   0.688

446

asked Jun 19 '19 11:06

petrovski

1 Answers

Some timings:

Click to copy

bench::mark(
    chaining = DT0[, d := a + b][, e := b + c][, f := a + c],
    assign = DT1[, c("d", "e", "f") := .(a+b, b+c, a+c)],
    assign2 = DT1.1[, `:=` (d, a + b)][, `:=` (e, b + c)][, `:=` (f, a + c)],
    use_set = {
        set(DT2, NULL, "d", DT2[["a"]]+DT2[["b"]])
        set(DT2, NULL, "e", DT2[["b"]]+DT2[["c"]])
        set(DT2, NULL, "f", DT2[["a"]]+DT2[["c"]])
    },
    functional = DT3[, `:=`(d = a + b, e = b + c, f = a + c)]
)

timings and memory usage:

Click to copy

  expression     min    mean  median     max `itr/sec` mem_alloc  n_gc n_itr total_time result           memory      time   gc       
  <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>           <list>      <list> <list>   
1 chaining     180ms   180ms   180ms   180ms      5.54     458MB     1     1      180ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
2 assign       320ms   320ms   320ms   320ms      3.12     916MB     1     1      320ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
3 assign2      188ms   188ms   188ms   188ms      5.33     458MB     1     1      188ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
4 use_set      322ms   323ms   323ms   323ms      3.10     916MB     0     2      645ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~
5 functional   331ms   331ms   331ms   331ms      3.02     916MB     1     1      331ms <data.table [20~ <Rprofmem ~ <bch:~ <tibble ~

data:

Click to copy

library(data.table) #data.table_1.12.2  
set.seed(0L)
n <- 2e7
DT <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT1.1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)

answered Oct 01 '22 01:10

2 revs

Related questions
                            
                                Is there a way to view a list
                            
                                How to find difference between values in two rows in an R dataframe using dplyr
                            
                                How to use Rcpp to speed up a for loop?
                            
                                Rename one named column in R
                            
                                How to change the first row to be the header in R?
                            
                                Creating a Unique Sequence of Dates
                            
                                How to use Outlier Tests in R Code
                            
                                tm_map has parallel::mclapply error in R 3.0.1 on Mac
                            
                                Get the right hand side variables of an R formula
                            
                                Find the indices of last occurrence of the unique elements in a vector
                            
                                Faster version of combn
                            
                                Idiom for ifelse-style recoding for multiple categories
                            
                                regex - return all before the second occurrence
                            
                                dplyr - summary table for multiple variables
                            
                                Access to MySQL with R using a pre 4.1.1 authentication protocol
                            
                                Error in : `data` must be a data frame, or other object coercible by `fortify()`
                            
                                Rtools is required to build R packages but is not currently installed
                            
                                Apply a function over all combinations of arguments
                            
                                Linked table of contents (toc) in md using rmarkdown
                            
                                adding a CSS Stylesheet in R shiny

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is := faster than `:=`()?

Tags:

r

data.table

Update:

petrovski

People also ask

1 Answers

2 revs

Recent Activity

Donate For Us