Reading multiple csv files faster into data.table R

Tags:

I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns. The code I am using so far:

library(data.table)
setwd("~/My/Folder")

WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}

938

asked Jul 09 '15 11:07

Carlo

1 Answers

Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:

dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")

You'll need to set the column names manually afterwards.

If you wanted to keep things in R, I'd use lapply and rbindlist:

lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)

You could also use plyr::ldply:

dt <- setDT(ldply(csv.list, fread))

This has the advantage that you can use .progress = "text" to get a readout of progress in reading.

All of the above assume that the files all have the same format and have a header row.

answered Oct 12 '22 01:10

Nick Kennedy

Related questions
                            
                                Explaining the forecasts from an ARIMA model
                            
                                Efficiently compute the row sums of a 3d array in R
                            
                                Order data frame by two columns in R
                            
                                How to set up an R based service on a web page [closed]
                            
                                wide to long multiple measures each time
                            
                                combining two plots in r
                            
                                Circular plot with vectors in R
                            
                                Creating line plot with time scale and labels in r
                            
                                Trying to get tf-idf weighting working in R
                            
                                Extract only coefficients whose p values are significant from a logistic model
                            
                                Getting driving distance between two points (lat, lon) using R and Google Map API
                            
                                Vary colors of axis labels in R based on another variable
                            
                                Is there an expression in `R` for "output of the last command"? [duplicate]
                            
                                Plotting points with color and shape based on data variables
                            
                                Labeling center of map polygons in R ggplot
                            
                                Merging two data.frames by key column
                            
                                Weighted sampling in R
                            
                                ggplot2 error : Discrete value supplied to continuous scale
                            
                                Freezing header and first column using data.table in Shiny
                            
                                How to reverse PCA in prcomp to get original data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading multiple csv files faster into data.table R

Tags:

performance

for-loop

r

data.table

Carlo

People also ask

1 Answers

Nick Kennedy

Recent Activity

Donate For Us