Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I cache data loading in R?

Tags:

caching

r

startup

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.

I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.

Example:

#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)

How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?

like image 395
TMOTTM Avatar asked Aug 21 '14 11:08

TMOTTM


People also ask

Can we store data in cache?

The data in a cache is generally stored in fast access hardware such as RAM (Random-access memory) and may also be used in correlation with a software component. A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.

What is cache memory R?

Cache memory is a chip-based computer component that makes retrieving data from the computer's memory more efficient. It acts as a temporary storage area that the computer's processor can retrieve data from easily.

Where is R Global environment stored?

At the end of a session the objects in the global environment are usually kept in a single binary file in the working directory called . RData. When a new R session begins with this as the initial working directory, the objects are loaded back into memory.


2 Answers

Package ‘R.cache’ R.cache

    start_year <- 2000
    end_year <- 2013
    brics_countries <- c("BR","RU", "IN", "CN", "ZA")
    indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
        "GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")

    key <- list(brics_countries, indics, start_year, end_year)
    brics_data <- loadCache(key)
    if (is.null(brics_data)) {
      brics_data <- WDI(country=brics_countries, indicator=indics, 
                        start=start_year, end=end_year,  extra=FALSE,                 cache=NULL)
      saveCache(brics_data, key=key, comment="brics_data")
    }
like image 57
Ricardo Padua Soares Avatar answered Sep 28 '22 16:09

Ricardo Padua Soares


Sort of. There are a few answers:

  1. Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.

  2. Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.

  3. Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.

  4. Load on demand, and eg the package SOAR package is useful here.

Direct caching, however, is not possible.

Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.

like image 42
Dirk Eddelbuettel Avatar answered Sep 28 '22 16:09

Dirk Eddelbuettel