Is there a specific method for combining a list of data.tables in R?
I have a list of ~20 data.tables, each with around 1 million rows, and would like to combine them into one data.table with 20 million rows.
I've been doing it with
Reduce('rbind', data.table)
but it takes a while.
Tnx!
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
use merge datatable activity if the no cloumns is equals in both the datatables if not you need iterate through each in datatable and the assign the data to another datable or you can use vbscript to add the first column from one datatable to another datatable.
See ?rbindlist
and these related questions (easier to find when you know what to search for!) :
data.table questions and answers containing rbindlist
Using do.call
appears to be about 10x faster with this made up example:
library(data.table) x1 <- data.table(x = runif(1e6), y = runif(1e6)) x2 <- data.table(x = runif(1e6), y = runif(1e6)) #20 data.tables all of length 1e6 yourList <- list(x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2,x1,x2) system.time(out1 <- Reduce("rbind", yourList)) #----- user system elapsed 3.37 3.03 6.43 system.time(out2 <- do.call("rbind", yourList)) #----- user system elapsed 0.33 0.36 0.68 all.equal(out1,out2) #----- [1] TRUE
I did not realize data.table
had a specific function for this task. Par for the course, it is quite fast. Here is the relevant timing:
system.time(out3 <- rbindlist(yourList)) #----- user system elapsed 0.07 0.03 0.11 all.equal(out1,out3) #----- [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With