Is there anyway to speed up the following process in R?
theFiles <- list.files(path="./lca_rs75_summary_logs", full.names=TRUE, pattern="*.summarylog")
listOfDataFrames <- NULL
masterDataFrame <- NULL
for (i in 1:length(theFiles)) {
tempDataFrame <- read.csv(theFiles[i], sep="\t", header=TRUE)
#Dropping some unnecessary row
toBeRemoved <- which(tempDataFrame$Name == "")
tempDataFrame <- tempDataFrame[-toBeRemoved,]
#Now stack the data frame on the master data frame
masterDataFrame <- rbind(masterDataFrame, tempDataFrame)
}
Basically, I am reading multiple csv files in a directory. I want to combine all the csv files to one giant data frame by stacking the rows. The loop seems to longer to run as the masterDataFrame is growing in size. I am doing this on a linux cluster.
Updated answer with data.table::fread
.
require(data.table)
out = rbindlist(lapply(theFiles, function(file) {
dt = fread(file)
# further processing/filtering
}))
fread()
automatically detects header, file separator, column classes, doesn't convert strings to factor by default.. handles embedded quotes, is quite fast etc.. See ?fread
for more.
See history for old answers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With