Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dangers of mixing [tidyverse] and [data.table] syntax in R?

I'm getting some very weird behavior from mixing tidyverse and data.table syntax. For context, I often find myself using tidyverse syntax, and then adding a pipe back to data.table when I need speed vs. when I need code readability. I know Hadley's working on a new package that uses tidyverse syntax with data.table speed, but from what I see, it's still in it's nascent phases, so I haven't been using it.

Anyone care to explain what's going on here? This is very scary for me, as I've probably done these thousands of times without thinking.

library(dplyr); library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

# now, what happens if I use a different tidyverse function (arrange) 
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1:   ALB Albania   UMIC

# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
like image 476
Daycent Avatar asked Jun 11 '21 15:06

Daycent


People also ask

Is data table faster than tidyverse?

The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.

Can you use dplyr with data table?

Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.

Is data table in tidyverse?

However unlike the base and tidyverse environments, the data must be in a data. table format. This can be accomplished by loading the the data file using the data. table 's fread() function, or by converting the data frame to a data.

Which is better dplyr or data table?

Data. table uses shorter syntax than dplyr, but is often more nuanced and complex. dplyr use a pipe operator, which is more intuitive for beginners to read and debug. Moreover, many other libraries use pipe operators, such as ggplot2 and tidyr.

Is it possible to fit a linear mixed model in R?

There is one complication you might face when fitting a linear mixed model. R may throw you a "failure to converge" error, which usually is phrased "iteration limit reached without convergence." That means your model has too many factors and not a big enough sample size, and cannot be fit.

What is the difference between the two R squared models?

As you can see there is not much difference in the two models in terms of R Squared, so both model are able to explain pretty much the same level of variation in yield.

How do you add random effects to a lmer equation?

You will want to load the lme4 package and make a call to the function lmer. The first argument to the function is a formula that takes the form y ~ x1 + x2 ... etc., where y is the response variable and x1, x2, etc. are explanatory variables. Random effects are added in with the explanatory variables.

Do we need linear mixed-effect models (LMMS)?

As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem. LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community.


2 Answers

I came across the same problem on a few occasions, which led me to avoid mixing dplyr with data.table syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.

Looks like dplyr::arrange is interfering with data.table auto-indexing :

  • index will be used when subsetting dataset with == or %in% on a single variable
  • by default if index for a variable is not present on filtering, it is automatically created and used
  • indexes are lost if you change the order of data
  • you can check if you are using index with options(datatable.verbose=TRUE)

If we explicitely set auto-indexing :

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = TRUE)

DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

To avoid this problem, you can disable auto-indexing:

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = FALSE) # Disabled

DT <- distinct(DT) %>%   as.data.table()

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange(iso3c) %>% as.data.table()

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.

like image 156
Waldi Avatar answered Oct 13 '22 18:10

Waldi


Using the tidytable package this doesn't happen (see below). It's now available on CRAN. tidytable allows you to use tidyverse syntax minimally altered (distinct., arrange.) while getting the speed of data.table, which is what OP seems to want overall (and who doesn't!).

library(data.table)
library(tidytable)



DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

DT <- distinct.(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange.(iso3c) %>% as.data.table()

# this is no longer wack
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

# and these work as normal:
DT[(iso3c %in% codes), ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes, ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC
like image 42
Fons MA Avatar answered Oct 13 '22 20:10

Fons MA