dplyr on data.table, am I really using data.table?

Tags:

If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.

Thanks in advance for any advice. Code Example:

library(data.table) library(dplyr)  diamondsDT <- data.table(ggplot2::diamonds) setkey(diamondsDT, cut)   diamondsDT %>%     filter(cut != "Fair") %>%     group_by(cut) %>%     summarize(AvgPrice = mean(price),                  MedianPrice = as.numeric(median(price)),                  Count = n()) %>%     arrange(desc(Count))

Results:

#         cut AvgPrice MedianPrice Count # 1     Ideal 3457.542      1810.0 21551 # 2   Premium 4584.258      3185.0 13791 # 3 Very Good 3981.760      2648.0 12082 # 4      Good 3928.864      3050.5  4906

Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:

diamondsDT [cut != "Fair"         ] [, .(AvgPrice = mean(price),                  MedianPrice = as.numeric(median(price)),                  Count = .N), by=cut         ] [ order(-Count) ]

776

asked Dec 16 '14 18:12

Polymerase

2 Answers

There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.

Operations involving `i` (== `filter()` and `slice()` in dplyr)

Assume DT with say 10 columns. Consider these data.table expressions:

DT[a > 1, .N]                    ## --- (1) DT[a > 1, mean(b), by=.(c, d)]   ## --- (2)

(1) gives the number of rows in DT where column a > 1. (2) returns mean(b) grouped by c,d for the same expression in i as (1).

Commonly used dplyr expressions would be:

DT %>% filter(a > 1) %>% summarise(n())                        ## --- (3)  DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

Clearly, data.table codes are shorter. In addition they are also more memory efficient¹. Why? Because in both (3) and (4), filter() returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d for the successive operations. To overcome this, we have to select() columns apriori:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5) DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

It is essential to highlight a major philosophical difference between the two packages:

In data.table, we like to keep these related operations together, and that allows to look at the j-expression (from the same function call) and realise there's no need for any columns in (1). The expression in i gets computed, and .N is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just column b,c,d are materialised in the subset, other columns are ignored.

But in dplyr, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation after filter() needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.

Note that in (5) and (6), we still subset column a which we don't require. But I'm not sure how to avoid that. If filter() function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).

Sub-assign by reference

dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.

For example, in data.table you can do:

DT[a %in% some_vals, a := NA]

which updates column a by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.

But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.

Other functionalities

In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.
data.table's rolling joins feature is not supported in dplyr's syntax as well.
We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function foverlaps() at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).

But ultimately, our goal is to integrate it into [.data.table so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.
Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex: DT[x == 1] and DT[x %in% some_vals] will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.

From the way filter() is implemented for data.tables, it doesn't take advantage of this feature.
A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.

So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.

HTH

(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.

119

answered Oct 23 '22 19:10

Arun

Just try it.

library(rbenchmark) library(dplyr) library(data.table)  benchmark( dplyr = diamondsDT %>%     filter(cut != "Fair") %>%     group_by(cut) %>%     summarize(AvgPrice = mean(price),                  MedianPrice = as.numeric(median(price)),                  Count = n()) %>%     arrange(desc(Count)), data.table = diamondsDT[cut != "Fair",                          list(AvgPrice = mean(price),                              MedianPrice = as.numeric(median(price)),                              Count = .N), by = cut][order(-Count)])[1:4]

On this problem it seems data.table is 2.4x faster than dplyr using data.table:

        test replications elapsed relative 2 data.table          100    2.39    1.000 1      dplyr          100    5.77    2.414

Revised based on Polymerase's comment.

answered Oct 23 '22 18:10

G. Grothendieck

Related questions
                            
                                Fitting polynomial model to data in R
                            
                                Unique on a dataframe with only selected columns
                            
                                Get filename without extension in R
                            
                                How to delete columns that contain ONLY NAs?
                            
                                Using roxygen2 and doxygen on the same package? [closed]
                            
                                Save plots made in a shiny app
                            
                                Getting the last n elements of a vector. Is there a better way than using the length() function?
                            
                                Sum rows in data.frame or matrix
                            
                                ggplot plots in scripts do not display in Rstudio
                            
                                Extract row corresponding to minimum value of a variable by group
                            
                                R: removing NULL elements from a list
                            
                                Get column index from label in a data frame
                            
                                Change the Blank Cells to "NA"
                            
                                Remove multiple objects with rm()
                            
                                Generate a dummy-variable
                            
                                Compile R script into standalone .exe file?
                            
                                Split text string in a data.table columns
                            
                                Find the location of a character in string
                            
                                Why are loops slow in R?
                            
                                How can I arrange an arbitrary number of ggplots using grid.arrange?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dplyr on data.table, am I really using data.table?

Tags:

r

data.table

dplyr

Polymerase

People also ask

2 Answers

Operations involving `i` (== `filter()` and `slice()` in dplyr)

Sub-assign by reference

Other functionalities

Arun

G. Grothendieck

Recent Activity

Donate For Us

dplyr on data.table, am I really using data.table?

Tags:

r

data.table

dplyr

Polymerase

People also ask

2 Answers

Operations involving i (== filter() and slice() in dplyr)

Sub-assign by reference

Other functionalities

Arun

G. Grothendieck

Related questions

Recent Activity

Donate For Us

Operations involving `i` (== `filter()` and `slice()` in dplyr)