Should I invest my learning effort for data wrangling in R, specifically between dplyr
, dtplyr
and data.table
?
I use dplyr
mostly, but when the data is too big for that I will use data.table
, which is a rare occurrence. So now that dtplyr
v1.0 is out as an interface for data.table
, on the surface it seems like I never need to worry about using the data.table
interface ever again.
So what are the most useful features or aspects of data.table
that cannot be done using dtplyr
at the moment, and that likely will never be done with dtplyr
?
On its face, dplyr
with the benefits of data.table
makes it sound like dtplyr
will overtake dplyr
. Will there be any reason to use dplyr
once dtplyr
has fully matured?
Note: I am not asking about dplyr
vs data.table
(as in data.table vs dplyr: can one do something well the other can't or does poorly?), but given that one is preferred over the other for a particular problem, why wouldn't dtplyr
be the tool to use.
I will try to give my best guides but it's not easy because one needs to be familiar with all of {data.table}, {dplyr}, {dtplyr} and also {base R}. I use {data.table} and many {tidy-world} packages (except {dplyr}). Love both, though I prefer syntax of data.table to dplyr's. I hope all tidy-world packages would use {dtplyr} or {data.table} as backend whenever it's necessary.
As with any other translation (think dplyr-to-sparkly/SQL), there are things that can or cannot be translated, at least for now. I mean, maybe one day {dtplyr} can make it 100% translated, who knows. The list below are not exhaustive nor it's 100% correct as I will try my best to answer based on my knowledge on related topics/packages/issues/etc.
Importantly, for those answers that are not entirely accurate, I hope it gives you some guides about what aspects of {data.table} you should pay attention to and, compare it to {dtplyr} and find out the answers by yourself. Don't take these answers for granted.
And, I hope this post can be used as one of the resources for all {dplyr}, {data.table} or {dtplyr} users/creators for discussions and collaborations and make #RStats even better.
{data.table} is not only used for fast & memory efficient operations. There are many people, including myself, prefer the elegant syntax of {data.table}. It also includes other fast operations like time-series functions like rolling-family (ie frollapply
) written in C. It can be used with any functions, including tidyverse. I use {data.table} + {purrr} a lot!
This can be easily translated
library(data.table)
library(dplyr)
library(flights)
data <- data.table(diamonds)
# dplyr
diamonds %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(
avg_price = mean(price),
median_price = as.numeric(median(price)),
count = n()
) %>%
arrange(desc(count))
# data.table
data [
][cut != 'Fair', by = cut, .(
avg_price = mean(price),
median_price = as.numeric(median(price)),
count = .N
)
][order( - count)]
{data.table} is very fast & memory efficient because (almost?) everything is built from the ground up from C with the key concepts of update-by-reference, key (think SQL), and their relentless optimization everywhere in the package (ie fifelse
, fread/fread
, radix sort order adopted by base R), while making sure the syntax is concise and consistent, that's why I think it's elegant.
From Introduction to data.table, main data manipulation operations such as subset, group, update, join, etc are kept together for
concise & consistent syntax...
performing analysis fluidly without the cognitive burden of having to map each operation...
automatically optimising operations internally, and very effectively, by knowing precisely the data required for each operation, leading to very fast and memory efficient code
The last point, as an example,
# Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.
flights[origin == 'JFK' & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
We first subset in i to find matching row indices where origin airport equals "JFK", and month equals 6L. We do not subset the entire data.table corresponding to those rows yet.
Now, we look at j and find that it uses only two columns. And what we have to do is to compute their mean(). Therefore we subset just those columns corresponding to the matching rows, and compute their mean().
Because the three main components of the query (i, j and by) are together inside [...], data.table can see all three and optimise the query altogether before evaluation, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns besides arr_delay and dep_delay), for both speed and memory efficiency.
Given that, to reap the benefits of {data.table}, translation of {dtplr} have to be correct in that respects. The more complex the operations, the harder the translations. For simple operations like above, it certainly can be easily translated. For complex ones, or those not supported by {dtplyr}, you have to find out yourself as mentioned above, one has to compare the translated syntax and benchmark and be familiar related packages.
For complex operations or unsupported operations, I might be able to provide some examples below. Again, I'm just trying my best. Be gentle on me.
I won't go into the intro/details but here are some links
Main resource: Reference semantics
More details: Understanding exactly when a data.table is a reference to (vs a copy of) another data.table
Update-by-reference, in my opinion, the most important feature of {data.table} and that's what make it so fast & memory efficient. dplyr::mutate
doesn't support it by default. As I'm not familiar with {dtplyr}, I'm not sure how much and what operations can or cannot be supported by {dtplyr}. As mentioned above, it also depends on the complexity of operations, which in turn affecting the translations.
There are two ways to use update-by-reference in {data.table}
assignment operator of {data.table} :=
set
-family: set
, setnames
, setcolorder
, setkey
, setDT
, fsetdiff
, and many more
:=
is more commonly used as compared to set
. For complex and large dataset, update-by-reference is the key to get top speed & memory efficiency. The easy way of thinking (not 100% accurate, as the details are much more complicated than this as it involves hard/shallow copy and many other factors), say you're dealing with large dataset of 10GB, with 10 columns and 1GB each. To manipulate one column, you need to deal with 1GB only.
The key point is, with update-by-reference, you only need to deal with the data required. That's why when using {data.table}, especially dealing with large dataset, we use update-by-reference all the time whenever possible. For example, manipulating large modeling dataset
# Manipulating list columns
df <- purrr::map_dfr(1:1e5, ~ iris)
dt <- data.table(df)
# data.table
dt [,
by = Species, .(data = .( .SD )) ][, # `.(` shorthand for `list`
model := map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = . )) ][,
summary := map(model, summary) ][,
plot := map(data, ~ ggplot( . , aes(Sepal.Length, Sepal.Width)) +
geom_point())]
# dplyr
df %>%
group_by(Species) %>%
nest() %>%
mutate(
model = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = . )),
summary = map(model, summary),
plot = map(data, ~ ggplot( . , aes(Sepal.Length, Sepal.Width)) +
geom_point())
)
The nesting operation list(.SD)
might not be supported by {dtlyr} as tidyverse users use tidyr::nest
? So I'm not sure if the subsequent operations can be translated as {data.table}'s way is faster & less memory.
NOTE: data.table's result is in "millisecond", dplyr in "minute"
df <- purrr::map_dfr(1:1e5, ~ iris)
dt <- copy(data.table(df))
bench::mark(
check = FALSE,
dt[, by = Species, .(data = list(.SD))],
df %>% group_by(Species) %>% nest()
)
# # A tibble: 2 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 dt[, by = Species, .(data = list(.SD))] 361.94ms 402.04ms 2.49 705.8MB 1.24 2 1
# 2 df %>% group_by(Species) %>% nest() 6.85m 6.85m 0.00243 1.4GB 2.28 1 937
# # ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
# # gc <list>
There are many use-cases of update-by-reference and even {data.table} users won't use the advanced version of it all the time as it require more codes. Whether {dtplyr} support these out-of-the-box, you have to find out yourself.
Main resource: Elegantly assigning multiple columns in data.table with lapply()
This involves either the more commonly used :=
or set
.
dt <- data.table( matrix(runif(10000), nrow = 100) )
# A few variants
for (col in paste0('V', 20:100))
set(dt, j = col, value = sqrt(get(col)))
for (col in paste0('V', 20:100))
dt[, (col) := sqrt(get(col))]
# I prefer `purrr::map` to `for`
library(purrr)
map(paste0('V', 20:100), ~ dt[, (.) := sqrt(get(.))])
As per creator of {data.table} Matt Dowle
(Note it may be more common to loop set over a large number of rows than a large number of columns.)
I needed fast join with relatively large data and similar join patterns recently, so I use power of update-by-reference, instead of normal joins. As they require more codes, I wrap them in private package with non-standard evaluation for reusability and readability where I call it setjoin
.
I did some benchmark here: data.table join + update-by-reference + setkey
Summary
# For brevity, only the codes for join-operation are shown here. Please refer to the link for details
# Normal_join
x <- y[x, on = 'a']
# update_by_reference
x_2[y_2, on = 'a', c := c]
# setkey_n_update
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
NOTE: dplyr::left_join
was also tested and it's the slowest with ~9,000 ms, use more memory than both {data.table}'s update_by_reference
and setkey_n_update
, but use less memory than {data.table}'s normal_join. It consumed about ~2.0GB of memory. I didn't include it as I want to focus solely on {data.table}.
setkey + update
and update
are ~11 and ~6.5 times faster than normal join
, respectivelysetkey + update
is similar to update
as overhead of setkey
largely offsets its own performance gainssetkey
is not required, setkey + update
is faster than update
by ~1.8 times (or faster than normal join
by ~11 times)For performant & memory efficient joins, use either update
or setkey + update
, where the latter is faster at the cost of more codes.
Let's see some pseudo codes, for brevity. The logics are the same.
For one or a few columns
a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)
# `update`
a[b, on = .(x), y := y]
a[b, on = .(x), `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x), `:=` (y = y, z = z, ...) ]
For many columns
cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]
Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin
above
- with update
- with or without setkey
setjoin(a, b, on = ...) # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
setjoin(...) %>%
setjoin(...)
With setkey
, argument on
can be omitted. It can also be included for readability, especially for collaborating with others.
set
setkey
)Related resource: Add a row by reference at the end of a data.table object
These are just some use cases of update-by-reference. There are many more.
As you can see, for advanced usage of dealing with large data, there are many use cases and techniques using update-by-reference for large dataset. It's not so easy to use in {data.table} and whether {dtplyr} supports it, you can find out yourself.
I focus on update-by-reference in this post as I think it's the most powerful feature of {data.table} for fast & memory efficient operations. That said, there are many, many other aspects that make it so efficient too and I think that are not natively supported by {dtplyr}.
What is/isn't supported, it also depends on complexity of operations and whether it involves data.table's native feature like update-by-reference or setkey
. And whether the translated code is the more efficient one (one that data.table users would write) is also another factor (ie the code is translated, but is it the efficient-version?). Many things are interconnected.
setkey
. See Keys and fast binary search based subset
frollapply
. rolling functions, rolling aggregates, sliding window, moving average
i
, j
or by
operations (you can use almost any expressions in there), I think the harder the translations, especially when it combine with update-by-reference, setkey
, and other native data.table functions like frollapply
stringr::str_*
family vs base R functions and I find base R are faster to a certain extent and use those. Point is, don't keep yourself to only tidyverse or data.table or..., explore other options to get the job done.Many of these aspects are inter-related with the points mentioned above
complexity of operations
update-by-reference
You can find out if {dtplyr} support these operations especially when they are combined.
Another useful tricks when dealing with small or large dataset, during interactive session, {data.table} really lives up to its promise of reducing programming and compute time tremendously.
Setting key for repetitively used variable for both speed and 'supercharged rownames' (subset without specifying variable name).
dt <- data.table(iris)
setkey(dt, Species)
dt['setosa', do_something(...), ...]
dt['virginica', do_another(...), ...]
dt['setosa', more(...), ...]
# `by` argument can also be omitted, particularly useful during interactive session
# this ultimately becomes what I call 'naked' syntax, just type what you want to do, without any placeholders.
# It's simply elegant
dt['setosa', do_something(...), Species, ...]
If your operations involve only simple ones like in the first example, {dtplyr} can get the job done. For complex/unsupported ones, you can use this guide to compare {dtplyr}'s translated ones with how seasoned data.table users would code in fast & memory efficient way with data.table's elegant syntax. Translation doesn't mean it's the most efficient way as there might be different techniques to deal with different cases of large data. For even larger dataset, you can combine {data.table} with {disk.frame}, {fst} and {drake} and other awesome packages to get the best of it. There is also a {big.data.table} but it's currently inactive.
I hope it helps everyone. Have a nice day ☺☺
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With