I'm working on a R
script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R
session, but developing multi-line functions is just so much less convenient on the R
prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d
is already available? Is there something like a cache=T
statement as in R
markdown code chunks?
The data in a cache is generally stored in fast access hardware such as RAM (Random-access memory) and may also be used in correlation with a software component. A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.
Cache memory is a chip-based computer component that makes retrieving data from the computer's memory more efficient. It acts as a temporary storage area that the computer's processor can retrieve data from easily.
At the end of a session the objects in the global environment are usually kept in a single binary file in the working directory called . RData. When a new R session begins with this as the initial working directory, the objects are loaded back into memory.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
Sort of. There are a few answers:
Use a faster csv read: fread()
in the data.table()
package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS()
so that next time you can do readRDS()
which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap
. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR
package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load()
/ save()
mechanism (which lots several objects at once where saveRSS()
/ readRDS()
work on a single object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With