Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns?

data.table with just 2 columns:

Suppose you wish to join two data.tables and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[:

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)

system.time({x = DT1[DT2, value1 - value]})     # One Step

system.time({x = DT1[DT2][, value1 - value]})   # Two Step

It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.67    0.00    0.67 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   0.14    0.01    0.16

Why is this?

data.table with many columns:

If you put a LOT of columns in to the data.table then you do eventually find that the one step approach is quicker - presumably because data.table only uses the columns you reference in j.

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.89    0.02    0.90 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   1.64    0.16    1.81

328

asked Jul 18 '13 09:07

Corvus

1 Answers

I think this is due to the repeated subsetting DT1[DT2, value1-value] makes for every name in DT2. That is, you've to perform a j operation for each i here, as opposed to just one j operation after the join. This becomes quite costly with 1e6 unique entries. That is, [.data.table becomes significant and noticeable.

DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]

In the first case, DT1[DT2], you perform the join first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1 for each value of DT2 - one 'j' operation per subset! You can see this better by just running this:

Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.96    97.96       0.98    100.00
# "-"                 0.02     2.04       0.02      2.04

Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.22    84.62       0.26    100.00
# "-"                 0.02     7.69       0.02      7.69
# "is.unsorted"       0.02     7.69       0.02      7.69

This overhead in repeated subsetting seems to be overcome when you've too many columns and the join on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.

answered Oct 26 '22 23:10

Arun

Related questions
                            
                                Include apsrtable (or stargazer) output in an Rmd file
                            
                                How to calculate readabilty in R with the tm package
                            
                                Split strings into columns in R where each string has a potentially different number of column entries
                            
                                Passing package name as argument in R
                            
                                Why lines function closes the path in R?
                            
                                Convert numeric to date
                            
                                R anonymous function: capture variables by value
                            
                                Removing backticks in R output
                            
                                passing file name to R from javascript using Rook package
                            
                                How to use "cast" in reshape without aggregation
                            
                                Newly added column in 'j' of data.table should be available in the scope
                            
                                Package for formatting numeric values in reproducible research
                            
                                Create minor gridlines in ggplot2 for categorical data
                            
                                knitr_child throws error after upgrade to R 3.0
                            
                                Installation of package ncdf fails due to missing header although it's there
                            
                                How to display percentage label on top of each bar [duplicate]
                            
                                R remove redundant parentheses from formula string or expression
                            
                                Fitted values for multinom in R: Coefficients for Reference Category?
                            
                                Group Dataframe hourly in R
                            
                                Get name of current user

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns?

Tags:

r

data.table