I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.
## First put all file names into a list library(data.table) all.files <- list.files(path = "C:/Users",pattern = ".txt") ## Read data using fread readdata <- function(fn){ dt_temp <- fread(fn, sep=",") keycols <- c("ID", "date") setkeyv(dt_temp,keycols) # Notice there's a "v" after setkey with multiple keys return(dt_temp) } # then using mylist <- lapply(all.files, readdata) mydata <- do.call('rbind',mylist)
The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.
If I use the fread
to read a single file, it's fast. But using apply
, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?
I tried the llply
in plyr
package, there're not much speed gains.
Also, is there any syntax in data.table
to achieve vertical join like rbind
and union
in sql
?
Thanks.
For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .
Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.
Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. One of the great things about this function is that all controls, expressed in arguments such as sep , colClasses and nrows are automatically detected.
Use rbindlist()
which is designed to rbind
a list
of data.table
's together...
mylist <- lapply(all.files, readdata) mydata <- rbindlist( mylist )
And as @Roland says, do not set the key in each iteration of your function!
So in summary, this is best :
l <- lapply(all.files, fread, sep=",") dt <- rbindlist( l ) setkey( dt , ID, date )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With