This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.
Suppose you wish to join two data.tables
and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[
:
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]}) # One Step
system.time({x = DT1[DT2][, value1 - value]}) # Two Step
It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.67 0.00 0.67
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
0.14 0.01 0.16
Why is this?
If you put a LOT of columns in to the data.table
then you do eventually find that the one step approach is quicker - presumably because data.table
only uses the columns you reference in j
.
N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})
> system.time({x = DT1[DT2, value1 - value]})
user system elapsed
0.89 0.02 0.90
> system.time({x = DT1[DT2][, value1 - value]})
user system elapsed
1.64 0.16 1.81
There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications.
It provides the efficient data. table object which is a much improved version of the default data. frame . It is super fast and has intuitive and terse syntax.
Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.
The most easiest way to drop columns is by using subset() function. In the code below, we are telling R to drop variables x and z. The '-' sign indicates dropping variables. Make sure the variable names would NOT be specified in quotes when using subset() function.
I think this is due to the repeated subsetting DT1[DT2, value1-value]
makes for every name
in DT2
. That is, you've to perform a j
operation for each i
here, as opposed to just one j
operation after the join
. This becomes quite costly with 1e6 unique entries. That is, [.data.table
becomes significant and noticeable.
DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]
In the first case, DT1[DT2]
, you perform the join
first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1
for each value of DT2
- one 'j' operation per subset! You can see this better by just running this:
Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()
# $by.self
# self.time self.pct total.time total.pct
# "[.data.table" 0.96 97.96 0.98 100.00
# "-" 0.02 2.04 0.02 2.04
Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()
# $by.self
# self.time self.pct total.time total.pct
# "[.data.table" 0.22 84.62 0.26 100.00
# "-" 0.02 7.69 0.02 7.69
# "is.unsorted" 0.02 7.69 0.02 7.69
This overhead in repeated subsetting seems to be overcome when you've too many columns and the join
on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With