Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

load new files in directory

Tags:

r

I have a R script to load multiple text files in a directory and save the data as compressed .rda. It looks like this,

#!/usr/bin/Rscript --vanilla

args <- commandArgs(TRUE)
## arg[1] is the folder name

outname <- paste(args[1], ".rda", sep="")

files <- list.files(path=args[1], pattern=".txt", full=TRUE)

tmp <- list()
if(file.exists(outname)){
  message("found ", outname)
  load(outname)
  tmp <- get(args[1]) # previously read stuff
  files <- setdiff(files, names(tmp))

}

 if(is.null(files)) 
    message("no new files") else {

## read the files into a list of matrices
results <- plyr::llply(files, read.table, .progress="text")
names(results) <- files

assign(args[1], c(tmp, results))
message("now saving... ", args[1])
save(list=args[1], file=outname)
}
message("all done!")

The files are quite large (15Mb each, 50 of them typically), so running this script takes up to a few minutes typically, a substantial part of which is taken writing the .rda results.

I often update the directory with new data files, therefore I would like to append them to the previously saved and compressed data. This is what I do above by checking if there's already an output file with that name. The last step is still pretty slow, saving the .rda file.

Is there a smarter way to go about this in some package, keeping a trace of which files have been read, and saving this faster?

I saw that knitr uses tools:::makeLazyLoadDB to save its cached computations, but this function is not documented so I'm not sure where it makes sense to use it.

like image 297
baptiste Avatar asked May 07 '12 10:05

baptiste


1 Answers

For intermediate files that I need to read (or write) often, I use

save (..., compress = FALSE)

which speeds up things considerably.

like image 117
cbeleites unhappy with SX Avatar answered Oct 23 '22 04:10

cbeleites unhappy with SX