Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache expensive operations in R

Tags:

caching

r

A very simple question:

I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.

This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).

Is there a way to cache what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?

Thanks.

like image 444
Roberto Avatar asked Jul 27 '10 19:07

Roberto


3 Answers

## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

like image 107
chrisamiller Avatar answered Oct 21 '22 04:10

chrisamiller


Some simple ways are doable with some combinations of

  • exists("foo") to test if a variable exists, else re-load or re-compute
  • file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.

There are also caching packages on CRAN that may be useful.

like image 27
Dirk Eddelbuettel Avatar answered Oct 21 '22 06:10

Dirk Eddelbuettel


After you do something you discover to be costly, save the results of that costly step in an R data file.

For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:

save(c(myVeryLargeDataFrame, VLDFSummary), 
  file="~/myProject/cachedData/VLDF.RData", 
  compress="bzip2")

The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.

After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:

load("~/myProject/cachedData/VLDF.RData")

This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.

like image 30
JD Long Avatar answered Oct 21 '22 04:10

JD Long