Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to read multiple csv to one data frame?

Tags:

r

Is there anyway to speed up the following process in R?

theFiles <- list.files(path="./lca_rs75_summary_logs", full.names=TRUE, pattern="*.summarylog")

listOfDataFrames <- NULL
masterDataFrame <- NULL

for (i in 1:length(theFiles)) {
    tempDataFrame <- read.csv(theFiles[i], sep="\t", header=TRUE)
    #Dropping some unnecessary row
    toBeRemoved <- which(tempDataFrame$Name == "")
    tempDataFrame <- tempDataFrame[-toBeRemoved,]
    #Now stack the data frame on the master data frame
    masterDataFrame <- rbind(masterDataFrame, tempDataFrame)
}

Basically, I am reading multiple csv files in a directory. I want to combine all the csv files to one giant data frame by stacking the rows. The loop seems to longer to run as the masterDataFrame is growing in size. I am doing this on a linux cluster.

like image 546
WonderSteve Avatar asked Apr 11 '13 22:04

WonderSteve


1 Answers

Updated answer with data.table::fread.

require(data.table)
out = rbindlist(lapply(theFiles, function(file) {
         dt = fread(file)
         # further processing/filtering
      }))

fread() automatically detects header, file separator, column classes, doesn't convert strings to factor by default.. handles embedded quotes, is quite fast etc.. See ?fread for more.


See history for old answers.

like image 133
Arun Avatar answered Oct 20 '22 23:10

Arun