What I really like about data.table
is the :=
idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes data.table
so ultrafast compared to other methods.
Now, I started playing around with the dplyr
package which seems to be equally performant. But since results still have to be assigned using the <-
operator, I was expecting a performance drain at this level. However, there seems to be none.
As an example:
library(dplyr)
library(Lahman)
library(microbenchmark)
library(ggplot2)
df <- Batting[ c("yearID", "teamID", "G_batting") ]
mb <- microbenchmark(
dplyr = {
tb <- tbl_df( df )
tb <- tb %.%
group_by( yearID, teamID ) %.%
mutate( G_batting = max(G_batting) )
},
data.table = {
dt <- as.data.table( df )
dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ]
},
times = 500
)
qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" )
I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <-
wrong?
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones.
In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.
You can use the mutate() function from the dplyr package to add one or more columns to a data frame in R.
To create the new variable, we start with the data frame with the pipe operator and use mutate() function. Inside mutate() function, we specify the name of the new variable we are creating and how exactly we are creating.
Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future.
Additionally if your intent is to find out if mutate
is performing a copy, then all you've to do is to check the address
before and after (this can be done using .Internal(inspect(.))
in base R
or using the function changes()
in dplyr
).
There are two different things to be checked here. A) creating a new column, and B) modifying an existing column.
A) Creating a new column:
require(dplyr)
require(data.table)
df <- tbl_df(data.frame(x=1:5, y=6:10))
df2 <- mutate(df, z=1L)
changes(df, df2)
# Changed variables:
# old new
# z 0x105ec36d0
It tells you that there are no changes in the addresses of x
and y
, and points out z
we just added. What's happening here?
dplyr
shallow copies the data.frame
and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically df2
is created with 3 columns, where the first two columns are pointing to the same address location as that of df
and the 3rd column was just created.
On the other hand, data.table
doesn't have to shallow copy, as it modifies the column by reference (in-place). data.table
also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference.
There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows:
require(data.table) # 1.8.11
require(dplyr) # latest commit from github
dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4)))
ans1 <- sapply(1:1e2, function(x) {
dd <- copy(dt) # so as to create the new column each time
system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed']
# or equivalently of dd[, V1001 := 1L]
})
df <- tbl_df(as.data.frame(dt))
ans2 <- sapply(1:1e2, function(x) {
system.time(mutate(df, V1001 = 1L))['elapsed']
})
> summary(ans1) # data.table
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00000 0.00100 0.00061 0.00100 0.00100
> summary(ans2) # dplyr
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.03800 0.03900 0.03900 0.04178 0.04100 0.07900
You can see the difference in the "mean time" here (0.00061 vs 0.04178)..
B) Modify an existing column:
df2 <- mutate(df, y=1L)
changes(df, df2)
# Changed variables:
# old new
# y 0x105e5a850 0x105e590e0
It tells you that y
has been changed - a copy of column y
has been made. It had to create a new memory location to change the values of y
, because it was pointing to the same location as that of df
's y
before.
However, since data.table
modifies in place there'll be no copy made in case of (B). It'll modify df
in place. So you should see a performance difference if you are modifying columns.
This is one of the fundamental differences in the philosophies between the two packages.
dplyr
doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns.
And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is:
DT[x >= 5L, y := 1L] # y is an existing column
This can't be done without an entire copy of the data.frame using base R
and dplyr
, to my knowledge.
Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The data.table
philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (:=
) is analogous to UPDATE in SQL.
To understand what's going on, you need to understand what exactly is being copied. A data frame is actually quite cheap to copy because it's basically just an array of pointers to the columns. Doing a shallow copy of a data frame is very cheap, because you just have to copy those pointers.
However, most base R function do a deep copy. So when you do:
df <- data.frame(x = 1:10, y = 1:10)
transform(df, z = x + y)
not only does R copy the data frame, it actually copies each individual column. dplyr provides the changes()
function to make this easier to see. For each column in the data frame, it displays the memory location where that column lives. If it's changed, then the complete column has been copied:
df2 <- transform(df, z = x + y)
changes(df, df2)
#> Changed variables:
#> old new
#> x 0x7fb19adcd378 0x7fb19ab9bcb8
#> y 0x7fb19adcd3d0 0x7fb19ab9bd10
#> z 0x7fb19ab9bd68
#>
#> Changed attributes:
#> old new
#> names 0x7fb19adcce98 0x7fb1944e4558
#> row.names 0x7fb19ab2bd10 0x7fb19ab2bf20
#> class 0x7fb19ad5d208 0x7fb19ab51b28
If we do the same thing in dplyr, the original columns aren't copied:
df3 <- dplyr::mutate(df, z = x + y)
changes(df, df3)
#> Changed variables:
#> old new
#> z 0x7fb19adcd060
#>
#> Changed attributes:
#> old new
#> names 0x7fb19adcce98 0x7fb1944e8b18
#> row.names 0x7fb19ab9c0d8 0x7fb19ab9c340
#> class 0x7fb19ad5d208 0x7fb19ad69408
This makes dplyr much faster than base R.
Data.table is a little bit faster again, because it allows you to modify the data table in place - it doesn't even have to copy the pointers to the columns. I think not modifying in place makes dplyr a little easier to understand (because it perserves usual R semantics), at the cost of being a little slower (but the cost grows with the number of columns, not number of rows).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With