Can I cache data loading in R?

Tags:

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.

I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.

Example:

#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)

How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?

395

asked Aug 21 '14 11:08

TMOTTM

2 Answers

Package ‘R.cache’ R.cache

    start_year <- 2000
    end_year <- 2013
    brics_countries <- c("BR","RU", "IN", "CN", "ZA")
    indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
        "GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")

    key <- list(brics_countries, indics, start_year, end_year)
    brics_data <- loadCache(key)
    if (is.null(brics_data)) {
      brics_data <- WDI(country=brics_countries, indicator=indics, 
                        start=start_year, end=end_year,  extra=FALSE,                 cache=NULL)
      saveCache(brics_data, key=key, comment="brics_data")
    }

answered Sep 28 '22 16:09

Ricardo Padua Soares

Sort of. There are a few answers:

Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.

Direct caching, however, is not possible.

Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.

answered Sep 28 '22 16:09

Dirk Eddelbuettel

Related questions
                            
                                Google Analytics does not work with blogdown
                            
                                Add group mean line to barplot with ggplot2
                            
                                How do we configure shinyserver open source to support concurrent users
                            
                                Automatically coerce all column types of one data frame to the type of another prior to binding
                            
                                What is the best way to avoid passing a data frame around?
                            
                                Recommendations for database with R [closed]
                            
                                Alternatives to system() in R for calling sed, rsync, ssh etc.: Do functions exist, should I write my own, or am I missing the point?
                            
                                Plotting a raster behind a shapefile
                            
                                Add a vertical line with ggplot when x-axis is a factor
                            
                                obtain hour from DateTime vector
                            
                                R: eval(parse(...)) is often suboptimal
                            
                                Comparison of Python and R vocabularies
                            
                                Matching multiple columns on different data frames and getting other column as result
                            
                                Embed R code in python
                            
                                how to find the list of different geoms and aesthetics [duplicate]
                            
                                Detach several packages at once
                            
                                How to install/locate R.h and Rmath.h header files?
                            
                                xtable for conditional cell formatting significant p-values of table
                            
                                R lattice package: add legend to a figure
                            
                                Searching for a straightforward way to do Stata's bysort tasks in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I cache data loading in R?

Tags:

caching

r

startup

TMOTTM

People also ask

2 Answers

Ricardo Padua Soares

Dirk Eddelbuettel

Recent Activity

Donate For Us