Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can't I do with dtplyr that I can in data.table

Should I invest my learning effort for data wrangling in R, specifically between dplyr, dtplyr and data.table?

  • I use dplyr mostly, but when the data is too big for that I will use data.table, which is a rare occurrence. So now that dtplyr v1.0 is out as an interface for data.table, on the surface it seems like I never need to worry about using the data.table interface ever again.

  • So what are the most useful features or aspects of data.table that cannot be done using dtplyr at the moment, and that likely will never be done with dtplyr?

  • On its face, dplyr with the benefits of data.table makes it sound like dtplyr will overtake dplyr. Will there be any reason to use dplyr once dtplyr has fully matured?

Note: I am not asking about dplyr vs data.table (as in data.table vs dplyr: can one do something well the other can't or does poorly?), but given that one is preferred over the other for a particular problem, why wouldn't dtplyr be the tool to use.

like image 671
dule arnaux Avatar asked Nov 26 '19 15:11

dule arnaux


Video Answer


1 Answers

I will try to give my best guides but it's not easy because one needs to be familiar with all of {data.table}, {dplyr}, {dtplyr} and also {base R}. I use {data.table} and many {tidy-world} packages (except {dplyr}). Love both, though I prefer syntax of data.table to dplyr's. I hope all tidy-world packages would use {dtplyr} or {data.table} as backend whenever it's necessary.

As with any other translation (think dplyr-to-sparkly/SQL), there are things that can or cannot be translated, at least for now. I mean, maybe one day {dtplyr} can make it 100% translated, who knows. The list below are not exhaustive nor it's 100% correct as I will try my best to answer based on my knowledge on related topics/packages/issues/etc.

Importantly, for those answers that are not entirely accurate, I hope it gives you some guides about what aspects of {data.table} you should pay attention to and, compare it to {dtplyr} and find out the answers by yourself. Don't take these answers for granted.

And, I hope this post can be used as one of the resources for all {dplyr}, {data.table} or {dtplyr} users/creators for discussions and collaborations and make #RStats even better.

{data.table} is not only used for fast & memory efficient operations. There are many people, including myself, prefer the elegant syntax of {data.table}. It also includes other fast operations like time-series functions like rolling-family (ie frollapply) written in C. It can be used with any functions, including tidyverse. I use {data.table} + {purrr} a lot!

Complexity of operations

This can be easily translated

library(data.table)
library(dplyr)
library(flights)
data <- data.table(diamonds)

# dplyr 
diamonds %>%
  filter(cut != "Fair") %>% 
  group_by(cut) %>% 
  summarize(
    avg_price    = mean(price),
    median_price = as.numeric(median(price)),
    count        = n()
  ) %>%
  arrange(desc(count))

# data.table
data [
  ][cut != 'Fair', by = cut, .(
      avg_price    = mean(price),
      median_price = as.numeric(median(price)),
      count        = .N
    )
  ][order( - count)]

{data.table} is very fast & memory efficient because (almost?) everything is built from the ground up from C with the key concepts of update-by-reference, key (think SQL), and their relentless optimization everywhere in the package (ie fifelse, fread/fread, radix sort order adopted by base R), while making sure the syntax is concise and consistent, that's why I think it's elegant.

From Introduction to data.table, main data manipulation operations such as subset, group, update, join, etc are kept together for

  • concise & consistent syntax...

  • performing analysis fluidly without the cognitive burden of having to map each operation...

  • automatically optimising operations internally, and very effectively, by knowing precisely the data required for each operation, leading to very fast and memory efficient code

The last point, as an example,

# Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.
flights[origin == 'JFK' & month == 6L,
        .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
  • We first subset in i to find matching row indices where origin airport equals "JFK", and month equals 6L. We do not subset the entire data.table corresponding to those rows yet.

  • Now, we look at j and find that it uses only two columns. And what we have to do is to compute their mean(). Therefore we subset just those columns corresponding to the matching rows, and compute their mean().

Because the three main components of the query (i, j and by) are together inside [...], data.table can see all three and optimise the query altogether before evaluation, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns besides arr_delay and dep_delay), for both speed and memory efficiency.

Given that, to reap the benefits of {data.table}, translation of {dtplr} have to be correct in that respects. The more complex the operations, the harder the translations. For simple operations like above, it certainly can be easily translated. For complex ones, or those not supported by {dtplyr}, you have to find out yourself as mentioned above, one has to compare the translated syntax and benchmark and be familiar related packages.

For complex operations or unsupported operations, I might be able to provide some examples below. Again, I'm just trying my best. Be gentle on me.

Update-by-reference

I won't go into the intro/details but here are some links

Main resource: Reference semantics

More details: Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

Update-by-reference, in my opinion, the most important feature of {data.table} and that's what make it so fast & memory efficient. dplyr::mutate doesn't support it by default. As I'm not familiar with {dtplyr}, I'm not sure how much and what operations can or cannot be supported by {dtplyr}. As mentioned above, it also depends on the complexity of operations, which in turn affecting the translations.

There are two ways to use update-by-reference in {data.table}

  • assignment operator of {data.table} :=

  • set-family: set, setnames, setcolorder, setkey, setDT, fsetdiff, and many more

:= is more commonly used as compared to set. For complex and large dataset, update-by-reference is the key to get top speed & memory efficiency. The easy way of thinking (not 100% accurate, as the details are much more complicated than this as it involves hard/shallow copy and many other factors), say you're dealing with large dataset of 10GB, with 10 columns and 1GB each. To manipulate one column, you need to deal with 1GB only.

The key point is, with update-by-reference, you only need to deal with the data required. That's why when using {data.table}, especially dealing with large dataset, we use update-by-reference all the time whenever possible. For example, manipulating large modeling dataset

# Manipulating list columns

df <- purrr::map_dfr(1:1e5, ~ iris)
dt <- data.table(df)

# data.table
dt [,
    by = Species, .(data   = .( .SD )) ][,  # `.(` shorthand for `list`
    model   := map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = . )) ][,
    summary := map(model, summary) ][,
    plot    := map(data, ~ ggplot( . , aes(Sepal.Length, Sepal.Width)) +
                           geom_point())]

# dplyr
df %>% 
  group_by(Species) %>% 
  nest() %>% 
  mutate(
    model   = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = . )),
    summary = map(model, summary),
    plot    = map(data, ~ ggplot( . , aes(Sepal.Length, Sepal.Width)) +
                          geom_point())
  )

The nesting operation list(.SD) might not be supported by {dtlyr} as tidyverse users use tidyr::nest? So I'm not sure if the subsequent operations can be translated as {data.table}'s way is faster & less memory.

NOTE: data.table's result is in "millisecond", dplyr in "minute"

df <- purrr::map_dfr(1:1e5, ~ iris)
dt <- copy(data.table(df))

bench::mark(
  check = FALSE,

  dt[, by = Species, .(data = list(.SD))],
  df %>% group_by(Species) %>% nest()
)
# # A tibble: 2 x 13
#   expression                                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 dt[, by = Species, .(data = list(.SD))] 361.94ms 402.04ms   2.49      705.8MB     1.24     2     1
# 2 df %>% group_by(Species) %>% nest()        6.85m    6.85m   0.00243     1.4GB     2.28     1   937
# # ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
# #   gc <list>

There are many use-cases of update-by-reference and even {data.table} users won't use the advanced version of it all the time as it require more codes. Whether {dtplyr} support these out-of-the-box, you have to find out yourself.

Multiple update-by-reference for same functions

Main resource: Elegantly assigning multiple columns in data.table with lapply()

This involves either the more commonly used := or set.

dt <- data.table( matrix(runif(10000), nrow = 100) )

# A few variants

for (col in paste0('V', 20:100))
  set(dt, j = col, value = sqrt(get(col)))

for (col in paste0('V', 20:100))
  dt[, (col) := sqrt(get(col))]

# I prefer `purrr::map` to `for`
library(purrr)
map(paste0('V', 20:100), ~ dt[, (.) := sqrt(get(.))])

As per creator of {data.table} Matt Dowle

(Note it may be more common to loop set over a large number of rows than a large number of columns.)

Join + setkey + update-by-reference

I needed fast join with relatively large data and similar join patterns recently, so I use power of update-by-reference, instead of normal joins. As they require more codes, I wrap them in private package with non-standard evaluation for reusability and readability where I call it setjoin.

I did some benchmark here: data.table join + update-by-reference + setkey

Summary

# For brevity, only the codes for join-operation are shown here. Please refer to the link for details

# Normal_join
x <- y[x, on = 'a']

# update_by_reference
x_2[y_2, on = 'a', c := c]

# setkey_n_update
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]

NOTE: dplyr::left_join was also tested and it's the slowest with ~9,000 ms, use more memory than both {data.table}'s update_by_reference and setkey_n_update, but use less memory than {data.table}'s normal_join. It consumed about ~2.0GB of memory. I didn't include it as I want to focus solely on {data.table}.

Key findings

  • setkey + update and update are ~11 and ~6.5 times faster than normal join, respectively
  • on first join, performance of setkey + update is similar to update as overhead of setkey largely offsets its own performance gains
  • on second and subsequent joins, as setkey is not required, setkey + update is faster than update by ~1.8 times (or faster than normal join by ~11 times)

Image

Examples

For performant & memory efficient joins, use either update or setkey + update, where the latter is faster at the cost of more codes.

Let's see some pseudo codes, for brevity. The logics are the same.

For one or a few columns

a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)

# `update`
a[b, on = .(x), y := y]
a[b, on = .(x),  `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x),  `:=` (y = y, z = z, ...) ]

For many columns

cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]

Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin above - with update - with or without setkey

setjoin(a, b, on = ...)  # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop   = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
  setjoin(...) %>%
  setjoin(...)

With setkey, argument on can be omitted. It can also be included for readability, especially for collaborating with others.

Large row-operation

  • as mentioned above, use set
  • pre-populate your table, use update-by-reference techniques
  • subset using key (ie setkey)

Related resource: Add a row by reference at the end of a data.table object

Summary of update-by-reference

These are just some use cases of update-by-reference. There are many more.

As you can see, for advanced usage of dealing with large data, there are many use cases and techniques using update-by-reference for large dataset. It's not so easy to use in {data.table} and whether {dtplyr} supports it, you can find out yourself.

I focus on update-by-reference in this post as I think it's the most powerful feature of {data.table} for fast & memory efficient operations. That said, there are many, many other aspects that make it so efficient too and I think that are not natively supported by {dtplyr}.

Other key aspects

What is/isn't supported, it also depends on complexity of operations and whether it involves data.table's native feature like update-by-reference or setkey. And whether the translated code is the more efficient one (one that data.table users would write) is also another factor (ie the code is translated, but is it the efficient-version?). Many things are interconnected.

  • setkey. See Keys and fast binary search based subset
  • Secondary indices and auto indexing
  • Using .SD for Data Analysis
  • time-series functions: think frollapply. rolling functions, rolling aggregates, sliding window, moving average
  • rolling join, non-equi join, (some) "cross" join
  • {data.table} has built the foundation in speed & memory efficiency, in future, it can extend to include many functions (like how they implement time-series functions mentioned above)
  • in general, the more complex operations on data.table's i, j or by operations (you can use almost any expressions in there), I think the harder the translations, especially when it combine with update-by-reference, setkey, and other native data.table functions like frollapply
  • Another point is related to using base R or tidyverse. I use both data.table + tidyverse (except dplyr/readr/tidyr). For large operations, I often benchmark, for example, stringr::str_* family vs base R functions and I find base R are faster to a certain extent and use those. Point is, don't keep yourself to only tidyverse or data.table or..., explore other options to get the job done.

Many of these aspects are inter-related with the points mentioned above

  • complexity of operations

  • update-by-reference

You can find out if {dtplyr} support these operations especially when they are combined.

Another useful tricks when dealing with small or large dataset, during interactive session, {data.table} really lives up to its promise of reducing programming and compute time tremendously.

Setting key for repetitively used variable for both speed and 'supercharged rownames' (subset without specifying variable name).

dt <- data.table(iris)
setkey(dt, Species) 

dt['setosa',    do_something(...), ...]
dt['virginica', do_another(...),   ...]
dt['setosa',    more(...),         ...]

# `by` argument can also be omitted, particularly useful during interactive session
# this ultimately becomes what I call 'naked' syntax, just type what you want to do, without any placeholders. 
# It's simply elegant
dt['setosa', do_something(...), Species, ...]

If your operations involve only simple ones like in the first example, {dtplyr} can get the job done. For complex/unsupported ones, you can use this guide to compare {dtplyr}'s translated ones with how seasoned data.table users would code in fast & memory efficient way with data.table's elegant syntax. Translation doesn't mean it's the most efficient way as there might be different techniques to deal with different cases of large data. For even larger dataset, you can combine {data.table} with {disk.frame}, {fst} and {drake} and other awesome packages to get the best of it. There is also a {big.data.table} but it's currently inactive.

I hope it helps everyone. Have a nice day ☺☺

like image 105
kar9222 Avatar answered Sep 19 '22 08:09

kar9222