Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast reading and combining several files using data.table (with fread)

I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.

## First put all file names into a list  library(data.table) all.files <- list.files(path = "C:/Users",pattern = ".txt")  ## Read data using fread readdata <- function(fn){     dt_temp <- fread(fn, sep=",")     keycols <- c("ID", "date")     setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys     return(dt_temp)  } # then using  mylist <- lapply(all.files, readdata) mydata <- do.call('rbind',mylist) 

The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.

If I use the fread to read a single file, it's fast. But using apply, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?

I tried the llply in plyr package, there're not much speed gains.

Also, is there any syntax in data.table to achieve vertical join like rbind and union in sql?

Thanks.

like image 867
Bigchao Avatar asked Jan 16 '14 08:01

Bigchao


People also ask

Is fread faster than read CSV?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .

What is the difference between fread and read CSV?

Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB. Interestingly, even though very slow, base R also spent less memory than the tidyverse suite.

What does fread in R do?

Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. One of the great things about this function is that all controls, expressed in arguments such as sep , colClasses and nrows are automatically detected.


1 Answers

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata) mydata <- rbindlist( mylist ) 

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",") dt <- rbindlist( l ) setkey( dt , ID, date ) 
like image 127
Simon O'Hanlon Avatar answered Sep 19 '22 19:09

Simon O'Hanlon