What I really like about <code>data.table</code> is the <code>:=</code> idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes <code>data.table</code> so ultrafast compared to other methods. Now, I started playing around with the <code>dplyr</code> package which seems to be equally performant. But since results still have to be assigned using the <code><-</code> operator, I was expecting a performance drain at this level. However, there seems to be none. As an example: <pre class="prettyprint"><code>library(dplyr) library(Lahman) library(microbenchmark) library(ggplot2) df <- Batting[ c("yearID", "teamID", "G_batting") ] mb <- microbenchmark( dplyr = { tb <- tbl_df( df ) tb <- tb %.% group_by( yearID, teamID ) %.% mutate( G_batting = max(G_batting) ) }, data.table = { dt <- as.data.table( df ) dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ] }, times = 500 ) qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" ) </code></pre> <img src="https://i.stack.imgur.com/H1pJa.png" alt="enter image description here"> I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <code><-</code> wrong?

Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future. Additionally if your intent is to find out if <code>mutate</code> is performing a copy, then all you've to do is to check the <code>address</code> before and after (this can be done using <code>.Internal(inspect(.))</code> in base <code>R</code> or using the function <code>changes()</code> in <code>dplyr</code>). <hr> <h3>On to whether a copy is being made or not:</h3> There are two different things to be checked here. A) creating a new column, and B) modifying an existing column. A) Creating a new column: <pre class="prettyprint"><code>require(dplyr) require(data.table) df <- tbl_df(data.frame(x=1:5, y=6:10)) df2 <- mutate(df, z=1L) changes(df, df2) # Changed variables: # old new # z 0x105ec36d0 </code></pre> It tells you that there are no changes in the addresses of <code>x</code> and <code>y</code>, and points out <code>z</code> we just added. What's happening here? <code>dplyr</code> shallow copies the <code>data.frame</code> and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically <code>df2</code> is created with 3 columns, where the first two columns are pointing to the same address location as that of <code>df</code> and the 3rd column was just created. On the other hand, <code>data.table</code> doesn't have to shallow copy, as it modifies the column by reference (in-place). <code>data.table</code> also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference. There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows: <pre class="prettyprint"><code>require(data.table) # 1.8.11 require(dplyr) # latest commit from github dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4))) ans1 <- sapply(1:1e2, function(x) { dd <- copy(dt) # so as to create the new column each time system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed'] # or equivalently of dd[, V1001 := 1L] }) df <- tbl_df(as.data.frame(dt)) ans2 <- sapply(1:1e2, function(x) { system.time(mutate(df, V1001 = 1L))['elapsed'] }) > summary(ans1) # data.table Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00000 0.00000 0.00100 0.00061 0.00100 0.00100 > summary(ans2) # dplyr Min. 1st Qu. Median Mean 3rd Qu. Max. 0.03800 0.03900 0.03900 0.04178 0.04100 0.07900 </code></pre> You can see the difference in the "mean time" here (0.00061 vs 0.04178).. B) Modify an existing column: <pre class="prettyprint"><code>df2 <- mutate(df, y=1L) changes(df, df2) # Changed variables: # old new # y 0x105e5a850 0x105e590e0 </code></pre> It tells you that <code>y</code> has been changed - a copy of column <code>y</code> has been made. It had to create a new memory location to change the values of <code>y</code>, because it was pointing to the same location as that of <code>df</code>'s <code>y</code> before. However, since <code>data.table</code> modifies in place there'll be no copy made in case of (B). It'll modify <code>df</code> in place. So you should see a performance difference if you are modifying columns. <blockquote> This is one of the fundamental differences in the philosophies between the two packages. <code>dplyr</code> doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns. </blockquote> And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is: <pre class="prettyprint"><code>DT[x >= 5L, y := 1L] # y is an existing column </code></pre> This can't be done without an entire copy of the data.frame using base <code>R</code> and <code>dplyr</code>, to my knowledge. <hr> Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The <code>data.table</code> philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (<code>:=</code>) is analogous to UPDATE in SQL.

To understand what's going on, you need to understand what exactly is being copied. A data frame is actually quite cheap to copy because it's basically just an array of pointers to the columns. Doing a shallow copy of a data frame is very cheap, because you just have to copy those pointers. However, most base R function do a deep copy. So when you do: <pre class="prettyprint"><code>df <- data.frame(x = 1:10, y = 1:10) transform(df, z = x + y) </code></pre> not only does R copy the data frame, it actually copies each individual column. dplyr provides the <code>changes()</code> function to make this easier to see. For each column in the data frame, it displays the memory location where that column lives. If it's changed, then the complete column has been copied: <pre class="prettyprint"><code>df2 <- transform(df, z = x + y) changes(df, df2) #> Changed variables: #> old new #> x 0x7fb19adcd378 0x7fb19ab9bcb8 #> y 0x7fb19adcd3d0 0x7fb19ab9bd10 #> z 0x7fb19ab9bd68 #> #> Changed attributes: #> old new #> names 0x7fb19adcce98 0x7fb1944e4558 #> row.names 0x7fb19ab2bd10 0x7fb19ab2bf20 #> class 0x7fb19ad5d208 0x7fb19ab51b28 </code></pre> If we do the same thing in dplyr, the original columns aren't copied: <pre class="prettyprint"><code>df3 <- dplyr::mutate(df, z = x + y) changes(df, df3) #> Changed variables: #> old new #> z 0x7fb19adcd060 #> #> Changed attributes: #> old new #> names 0x7fb19adcce98 0x7fb1944e8b18 #> row.names 0x7fb19ab9c0d8 0x7fb19ab9c340 #> class 0x7fb19ad5d208 0x7fb19ad69408 </code></pre> This makes dplyr much faster than base R. Data.table is a little bit faster again, because it allows you to modify the data table in place - it doesn't even have to copy the pointers to the columns. I think not modifying in place makes dplyr a little easier to understand (because it perserves usual R semantics), at the cost of being a little slower (but the cost grows with the number of columns, not number of rows).

Does mutate change tbl by reference?

Q: What is the difference between mutate and transmute in R?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones.

Q: What does the mutate function do in R?

In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.

Q: Does mutate add a column in R?

You can use the mutate() function from the dplyr package to add one or more columns to a data frame in R.

Q: How do you mutate a new variable?

To create the new variable, we start with the data frame with the pipe operator and use mutate() function. Inside mutate() function, we specify the name of the new variable we are creating and how exactly we are creating.

Tags:

r

data.table

dplyr

What I really like about data.table is the := idiom for changing the table by reference, without the need for costly copies. From what I understand, this is one of the aspects which makes data.table so ultrafast compared to other methods.

Now, I started playing around with the dplyr package which seems to be equally performant. But since results still have to be assigned using the <- operator, I was expecting a performance drain at this level. However, there seems to be none.

As an example:

library(dplyr)
library(Lahman)
library(microbenchmark)
library(ggplot2)

df <- Batting[ c("yearID", "teamID", "G_batting") ]

mb <- microbenchmark(
  dplyr = {
    tb <- tbl_df( df )
    tb <- tb %.%
      group_by( yearID, teamID ) %.%
      mutate( G_batting = max(G_batting) )
  },
  data.table = {
    dt <- as.data.table( df )
    dt[ , G_batting := max(G_batting), by = list( yearID, teamID ) ]
  },
  times = 500
)

qplot( data = mb, x = expr, y = time * 1E-6, geom = "boxplot", ylab="time [ms]", xlab = "approach" )

enter image description here

I am just wondering how this is possible? Or is there a conceptual mistake in the way I benchmark? Is my understanding of <- wrong?

532

asked Feb 11 '14 12:02

Beasterfield

2 Answers

Nice question. In general, I'd benchmark on a data size that's big enough to not fit (almost) entirely in the cache. Have a look here under "initial setup". It really isn't meaningful to compare tools developed for (in-memory) big-data to run tasks that runs in milliseconds. We are planning to benchmark on relatively bigger data in the future.

Additionally if your intent is to find out if mutate is performing a copy, then all you've to do is to check the address before and after (this can be done using .Internal(inspect(.)) in base R or using the function changes() in dplyr).

On to whether a copy is being made or not:

There are two different things to be checked here. A) creating a new column, and B) modifying an existing column.

A) Creating a new column:

require(dplyr)
require(data.table)
df <- tbl_df(data.frame(x=1:5, y=6:10))

df2 <- mutate(df, z=1L)
changes(df, df2)
# Changed variables:
#           old new
# z             0x105ec36d0

It tells you that there are no changes in the addresses of x and y, and points out z we just added. What's happening here?

dplyr shallow copies the data.frame and then has added the new column. A shallow copy as opposed to a deep-copy just copies the vector of column pointers, not the data itself. Therefore it should be fast. Basically df2 is created with 3 columns, where the first two columns are pointing to the same address location as that of df and the 3rd column was just created.

On the other hand, data.table doesn't have to shallow copy, as it modifies the column by reference (in-place). data.table also (cleverly) over-allocates a list of column vectors that allows for fast adding of (new) columns by reference.

There should not be a huge difference in the time to shallow copy as long as you've too many columns. Here's a small benchmark on 5000 columns with 1e4 rows:

require(data.table) # 1.8.11
require(dplyr)      # latest commit from github

dt <- as.data.table(lapply(1:5e3, function(x) sample(1e4)))
ans1 <- sapply(1:1e2, function(x) {
    dd <- copy(dt) # so as to create the new column each time
    system.time(set(dd, i=NULL, j="V1001", value=1L))['elapsed'] 
    # or equivalently of dd[, V1001 := 1L]
})

df <- tbl_df(as.data.frame(dt))
ans2 <- sapply(1:1e2, function(x) {
    system.time(mutate(df, V1001 = 1L))['elapsed']
})
> summary(ans1) # data.table
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00000 0.00000 0.00100 0.00061 0.00100 0.00100
> summary(ans2) # dplyr
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.03800 0.03900 0.03900 0.04178 0.04100 0.07900

You can see the difference in the "mean time" here (0.00061 vs 0.04178)..

B) Modify an existing column:

df2 <- mutate(df, y=1L)
changes(df, df2)
# Changed variables:
#           old         new
# y         0x105e5a850 0x105e590e0

It tells you that y has been changed - a copy of column y has been made. It had to create a new memory location to change the values of y, because it was pointing to the same location as that of df's y before.

However, since data.table modifies in place there'll be no copy made in case of (B). It'll modify df in place. So you should see a performance difference if you are modifying columns.

This is one of the fundamental differences in the philosophies between the two packages. dplyr doesn't like modifying in-place and therefore trades-off by copying when modifying existing columns.

And because of this, it wouldn't be possible to change values of certain rows of a particular column of a data.frame without a deep-copy. That is:

DT[x >= 5L, y := 1L] # y is an existing column

This can't be done without an entire copy of the data.frame using base R and dplyr, to my knowledge.

Also, consider a 2 column dataset of size 20GB (two columns each 10GB) on a machine with 32GB RAM. The data.table philosophy is to provide a way to change a subset of those 10GB columns by reference, without copying even a single column once. A copy of one column would need an extra 10GB and may fail with out-out-memory, let alone be fast or not. This concept (:=) is analogous to UPDATE in SQL.

196

answered Sep 20 '22 14:09

Arun

To understand what's going on, you need to understand what exactly is being copied. A data frame is actually quite cheap to copy because it's basically just an array of pointers to the columns. Doing a shallow copy of a data frame is very cheap, because you just have to copy those pointers.

However, most base R function do a deep copy. So when you do:

df <- data.frame(x = 1:10, y = 1:10)
transform(df, z = x + y)

not only does R copy the data frame, it actually copies each individual column. dplyr provides the changes() function to make this easier to see. For each column in the data frame, it displays the memory location where that column lives. If it's changed, then the complete column has been copied:

df2 <- transform(df, z = x + y)
changes(df, df2)
#> Changed variables:
#>           old            new           
#> x         0x7fb19adcd378 0x7fb19ab9bcb8
#> y         0x7fb19adcd3d0 0x7fb19ab9bd10
#> z                        0x7fb19ab9bd68
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e4558
#> row.names 0x7fb19ab2bd10 0x7fb19ab2bf20
#> class     0x7fb19ad5d208 0x7fb19ab51b28

If we do the same thing in dplyr, the original columns aren't copied:

df3 <- dplyr::mutate(df, z = x + y)
changes(df, df3)
#> Changed variables:
#>           old new           
#> z             0x7fb19adcd060
#> 
#> Changed attributes:
#>           old            new           
#> names     0x7fb19adcce98 0x7fb1944e8b18
#> row.names 0x7fb19ab9c0d8 0x7fb19ab9c340
#> class     0x7fb19ad5d208 0x7fb19ad69408

This makes dplyr much faster than base R.

Data.table is a little bit faster again, because it allows you to modify the data table in place - it doesn't even have to copy the pointers to the columns. I think not modifying in place makes dplyr a little easier to understand (because it perserves usual R semantics), at the cost of being a little slower (but the cost grows with the number of columns, not number of rows).

answered Sep 17 '22 14:09

hadley

Related questions
                            
                                R Markdown HTML Number Figures
                            
                                Variable importance with ranger
                            
                                List to integer or double in R
                            
                                Using column numbers not names in lm()
                            
                                Removing duplicate combinations (irrespective of order)
                            
                                How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
                            
                                which.max ties method in R
                            
                                Error in get(as.character(FUN), mode = "function", envir = envir)
                            
                                Duplicate a column in data frame and rename it to another column name
                            
                                Write R data as csv directly to s3
                            
                                mapview for shiny
                            
                                How to compute AUC with ROCR package
                            
                                What is meaning of first tilde in purrr::map
                            
                                Object creation timestamp
                            
                                Any way to produce a LaTeX table from an lme4 mer model fit object?
                            
                                get filename and path of `source`d file
                            
                                Plot percentages on y-axis
                            
                                Pivot on data.table similar to rehape melt function
                            
                                Subscript out of bounds (Caret variable importance for randomForest) [duplicate]
                            
                                Creating dummy variables in R data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With