I was wondering if there is a memory efficient way to join n data.tables (or data frames). For example, if I have the following 4 data.tables:
df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")
I could merge them like so:
merge(df1,merge(df2,merge(df3,df4)))
but that does not seem like an optimal solution. I might potentially have many data.tables that need to be merged. Is there a way to generalize the above without copying each successive merge to memory? Is there an already accepted way outside of data.table to do this?
Here are some other options you may have, depending on your data. Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all/merge_all/wrap_em_all_up.
These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. First, some setup:
DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])
I'll assume the tables are all keyed by the bycols.
Stack. If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data:
DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)
If for some reason you really want the data in wide format, you can dcast:
dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)
Data.table and R work best with long-format data, though.
Copy cols. If you know that the bycols take all the same values in all of the tables, then just copy over:
DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}
Merge assign. If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns:
DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With