Dangers of mixing [tidyverse] and [data.table] syntax in R?

Tags:

I'm getting some very weird behavior from mixing tidyverse and data.table syntax. For context, I often find myself using tidyverse syntax, and then adding a pipe back to data.table when I need speed vs. when I need code readability. I know Hadley's working on a new package that uses tidyverse syntax with data.table speed, but from what I see, it's still in it's nascent phases, so I haven't been using it.

Anyone care to explain what's going on here? This is very scary for me, as I've probably done these thousands of times without thinking.

library(dplyr); library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

# now, what happens if I use a different tidyverse function (arrange) 
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1:   ALB Albania   UMIC

# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

476

asked Jun 11 '21 15:06

Daycent

2 Answers

I came across the same problem on a few occasions, which led me to avoid mixing dplyr with data.table syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.

Looks like dplyr::arrange is interfering with data.table auto-indexing :

index will be used when subsetting dataset with == or %in% on a single variable

by default if index for a variable is not present on filtering, it is automatically created and used

indexes are lost if you change the order of data

you can check if you are using index with options(datatable.verbose=TRUE)

If we explicitely set auto-indexing :

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = TRUE)

DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

To avoid this problem, you can disable auto-indexing:

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = FALSE) # Disabled

DT <- distinct(DT) %>%   as.data.table()

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange(iso3c) %>% as.data.table()

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.

156

answered Oct 13 '22 18:10

Waldi

Using the tidytable package this doesn't happen (see below). It's now available on CRAN. tidytable allows you to use tidyverse syntax minimally altered (distinct., arrange.) while getting the speed of data.table, which is what OP seems to want overall (and who doesn't!).

library(data.table)
library(tidytable)



DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

DT <- distinct.(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange.(iso3c) %>% as.data.table()

# this is no longer wack
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

# and these work as normal:
DT[(iso3c %in% codes), ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes, ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

answered Oct 13 '22 20:10

Fons MA

Related questions
                            
                                Splitting and manipulating nested lists
                            
                                dplyr : how-to programmatically full_join dataframes contained in a list of lists?
                            
                                How can I compile RpostgreSQL with libssl and libpg and SSL activation
                            
                                Separate symbol and color in plotly legend
                            
                                Dplyr produces NaN while base R produces NA
                            
                                Combined fuzzy and exact matching
                            
                                Automatically reloading shiny app when add changes
                            
                                Function that asks a ~special~ something and returns an answer
                            
                                How can i plot multiple isochrone polygons using OSRM in R?
                            
                                Importing matplotlib with reticulate in R
                            
                                R reticulate unable to find installed python library
                            
                                Update a table using subquery in SQLite
                            
                                Data scraping from published Power BI visual
                            
                                Remove progress bar from knitr output
                            
                                Weird output of poisson GLM with an iid random effect in r
                            
                                When is R's assign() function appropriate?
                            
                                How does Branch Prediction affect performance in R?
                            
                                Python shared library not found, Python bindings not loaded. in RStudio on Mac
                            
                                How can I save files in parallel without automatically increasing the file size?
                            
                                Adding an additional legend manually for different data.frames used in the same ggplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dangers of mixing [tidyverse] and [data.table] syntax in R?

Tags:

r

data.table

dplyr

tidyverse

Daycent

People also ask

2 Answers

Waldi

Fons MA

Recent Activity

Donate For Us