I'm getting some very weird behavior from mixing tidyverse
and data.table
syntax.
For context, I often find myself using tidyverse
syntax, and then adding a pipe back to data.table
when I need speed vs. when I need code readability. I know Hadley's working on a new package that uses tidyverse
syntax with data.table
speed, but from what I see, it's still in it's nascent phases, so I haven't been using it.
Anyone care to explain what's going on here? This is very scary for me, as I've probably done these thousands of times without thinking.
library(dplyr); library(data.table)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
# now, what happens if I use a different tidyverse function (arrange)
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1: ALB Albania UMIC
# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
The tidyverse, for example, emphasizes readability and flexibility, which is great when I need to write scaleable code that others can easily read. data. table, on the other hand, is lightening fast and very concise, so you can develop quickly and run super fast code, even when datasets get fairly large.
Each dplyr verb must do some work to convert dplyr syntax to data. table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets.
However unlike the base and tidyverse environments, the data must be in a data. table format. This can be accomplished by loading the the data file using the data. table 's fread() function, or by converting the data frame to a data.
Data. table uses shorter syntax than dplyr, but is often more nuanced and complex. dplyr use a pipe operator, which is more intuitive for beginners to read and debug. Moreover, many other libraries use pipe operators, such as ggplot2 and tidyr.
There is one complication you might face when fitting a linear mixed model. R may throw you a "failure to converge" error, which usually is phrased "iteration limit reached without convergence." That means your model has too many factors and not a big enough sample size, and cannot be fit.
As you can see there is not much difference in the two models in terms of R Squared, so both model are able to explain pretty much the same level of variation in yield.
You will want to load the lme4 package and make a call to the function lmer. The first argument to the function is a formula that takes the form y ~ x1 + x2 ... etc., where y is the response variable and x1, x2, etc. are explanatory variables. Random effects are added in with the explanatory variables.
As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem. LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community.
I came across the same problem on a few occasions, which led me to avoid mixing dplyr
with data.table
syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.
Looks like dplyr::arrange
is interfering with data.table
auto-indexing :
- index will be used when subsetting dataset with
==
or%in%
on a single variable- by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
If we explicitely set auto-indexing :
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE)
DT <- distinct(DT) %>% as.data.table()
# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC
# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
To avoid this problem, you can disable auto-indexing:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = FALSE) # Disabled
DT <- distinct(DT) %>% as.data.table()
# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.
Using the tidytable package this doesn't happen (see below). It's now available on CRAN. tidytable allows you to use tidyverse syntax minimally altered (distinct.
, arrange.
) while getting the speed of data.table, which is what OP seems to want overall (and who doesn't!).
library(data.table)
library(tidytable)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
DT <- distinct.(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange.(iso3c) %>% as.data.table()
# this is no longer wack
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
# and these work as normal:
DT[(iso3c %in% codes), ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes, ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With