Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why, for R data.tables dt1 & dt2, does dplyr::bind_rows(dt1, dt2)[con2] not yield the expected result, but rbindlist(dt1, dt2)[con2] does?

Can someone please explain what happens in this interactively run R code? I'm using dplyr 1.1.2 and data.table 1.14.8 on R 3.6.3 on Ubuntu Linux. Note that only after the line implicitly assigning an index to dt1 does this discrepancy occur. Why does dt[cond] give a different result to subset(dt, cond) in this case? And is it just generally "dangerous" to combine data.table and tidyverse functions in a chain?

library(data.table)
library(dplyr)
mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018]    # this appears to assigns `year` as an index to mydt1 
rbindlist(list(mydt1, mydt2))[year == 2017]
# produces expected output:
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3
# 4: 2017     4
# 5: 2017     5
# 6: 2017     6
subset(bind_rows(mydt1, mydt2), year == 2017)
# produces the same output as above
bind_rows(mydt1, mydt2)[year == 2017]
# what happens now?
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3
like image 309
climatestudent Avatar asked Dec 14 '25 07:12

climatestudent


1 Answers

What you're seeing is "auto-indexing". With data.table, there can be only one physical index (i.e., setkey), but there can also be secondary indices. As a feature (for faster subetting), whenever a subset is done using strict == or %in%, a secondary index is created and stored with the object.

With data.table objects, when combining tables, the indices are dropped, as one might expect:

indices(mydt1)
# [1] "year"        "year__month"
indices(mydt2)
# NULL
rbindlist(list(mydt1, mydt2)) |>
  indices()
# NULL
rbindlist(list(mydt2, mydt1)) |>
  indices()
# NULL

But with dplyr_1.1.2 (at least, not sure if/when things changed), it is preserving attributes of the first table.

bind_rows(mydt1, mydt2) |>
  indices()
# [1] "year"        "year__month"
bind_rows(mydt2, mydt1) |>
  indices()
# NULL

This latter discovery suggests that we can get the desired results by reversing the tables:

bind_rows(mydt1, mydt2)[year == 2017]
#     year month
#    <int> <int>
# 1:  2017     1
# 2:  2017     2
# 3:  2017     3
bind_rows(mydt2, mydt1)[year == 2017]
#     year month
#    <int> <int>
# 1:  2017     4
# 2:  2017     5
# 3:  2017     6
# 4:  2017     1
# 5:  2017     2
# 6:  2017     3

From the documentation (same link above), we can set the "datatable.use.index" option to prevent this behavior. In a new R session (since I didn't want to spoil things):

options(datatable.use.index = FALSE)
mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018]
#     year month
#    <int> <int>
# 1:  2018     1
# 2:  2018     2
# 3:  2018     3
indices(mydt1)
# NULL
bind_rows(mydt1, mydt2)[year == 2017]
#     year month
#    <int> <int>
# 1:  2017     1
# 2:  2017     2
# 3:  2017     3
# 4:  2017     4
# 5:  2017     5
# 6:  2017     6

So my guess is that this is not a "bug" in either dplyr or data.table, but it sure would be nice if dplyr "knew" about this secondary index and chose to not pass this one through when it row-binds them together.

like image 131
r2evans Avatar answered Dec 16 '25 21:12

r2evans



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!