Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is DT1[DT2][, value1-value] faster than DT1[DT2, value1-value] on data.table with fewer columns?

Tags:

r

data.table

This is related to this question (Can I access repeated column names in `j` in a data.table join?), that was asked because I assumed that the opposite to this was true.

data.table with just 2 columns:

Suppose you wish to join two data.tables and then perform a simple operation on two joined columns, this can be done either in one or two calls to .[:

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))
DT2 = data.table(name = 1:N, value1 = rnorm(N))
setkey(DT1, name)

system.time({x = DT1[DT2, value1 - value]})     # One Step

system.time({x = DT1[DT2][, value1 - value]})   # Two Step

It turns out that making two calls - doing the join first, and then doing the subtraction - is noticeably quicker than all in one go.

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.67    0.00    0.67 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   0.14    0.01    0.16 

Why is this?

data.table with many columns:

If you put a LOT of columns in to the data.table then you do eventually find that the one step approach is quicker - presumably because data.table only uses the columns you reference in j.

N = 1000000
DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi]
setkey(DT1, name)
system.time({x = DT1[DT2, value1 - value]})
system.time({x = DT1[DT2][, value1 - value]})

> system.time({x = DT1[DT2, value1 - value]})
   user  system elapsed 
   0.89    0.02    0.90 
> system.time({x = DT1[DT2][, value1 - value]})
   user  system elapsed 
   1.64    0.16    1.81 
like image 328
Corvus Avatar asked Jul 18 '13 09:07

Corvus


People also ask

Why data.table is so fast?

There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications.

What does the data.table () function provide to big data processing?

It provides the efficient data. table object which is a much improved version of the default data. frame . It is super fast and has intuitive and terse syntax.

Which library is data.table in R?

Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.

How do I remove a column from a data.table in R?

The most easiest way to drop columns is by using subset() function. In the code below, we are telling R to drop variables x and z. The '-' sign indicates dropping variables. Make sure the variable names would NOT be specified in quotes when using subset() function.


1 Answers

I think this is due to the repeated subsetting DT1[DT2, value1-value] makes for every name in DT2. That is, you've to perform a j operation for each i here, as opposed to just one j operation after the join. This becomes quite costly with 1e6 unique entries. That is, [.data.table becomes significant and noticeable.

DT1[DT2][, value1-value] # similar to rowSums
DT1[DT2, value1-value]

In the first case, DT1[DT2], you perform the join first, and it is really fast. Of course, with more columns, as you show, you'll see a difference. But the point is performing the join once. But in the second case, you're grouping DT1 by DT2's name and for every one of them you're computing the difference. That is, you're subsetting DT1 for each value of DT2 - one 'j' operation per subset! You can see this better by just running this:

Rprof()
t1 <- DT1[DT2, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.96    97.96       0.98    100.00
# "-"                 0.02     2.04       0.02      2.04

Rprof()
t2 <- DT1[DT2][, value1-value]
Rprof(NULL)
summaryRprof()

# $by.self
#                self.time self.pct total.time total.pct
# "[.data.table"      0.22    84.62       0.26    100.00
# "-"                 0.02     7.69       0.02      7.69
# "is.unsorted"       0.02     7.69       0.02      7.69

This overhead in repeated subsetting seems to be overcome when you've too many columns and the join on many columns overtakes as the time-consuming operation. You can probably check this out yourself by profiling the other code.

like image 83
Arun Avatar answered Oct 26 '22 23:10

Arun