Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading multiple csv files faster into data.table R

I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns. The code I am using so far:

library(data.table)
setwd("~/My/Folder")

WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}
like image 938
Carlo Avatar asked Jul 09 '15 11:07

Carlo


People also ask

How do I import multiple CSV files into R?

Using readr Package You can consider this as a third option to load multiple CSV files into R DataFrame, This method uses the read_csv() function readr package. readr is a third-party library hence, in order to use readr library, you need to first install it by using install. packages('readr') .

Is Read_csv faster than read csv?

csv() is actually faster than read_csv() while fread is much faster than both, although these savings are likely to be inconsequential for such small datasets.

Which is the fastest way to read data?

The correct answer is RAM. RAM is the fastest to read from and write to than the other kinds of storage in a computer.


1 Answers

Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:

dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")

You'll need to set the column names manually afterwards.

If you wanted to keep things in R, I'd use lapply and rbindlist:

lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)

You could also use plyr::ldply:

dt <- setDT(ldply(csv.list, fread))

This has the advantage that you can use .progress = "text" to get a readout of progress in reading.

All of the above assume that the files all have the same format and have a header row.

like image 93
Nick Kennedy Avatar answered Oct 12 '22 01:10

Nick Kennedy