Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What methods exist for distributing a semi-live dataset with an R package?

Tags:

r

packaging

I am building a package for internal use using devtools. I would like to have the package load in data from a file/connection (that differs depending on the date package is built). The data is large-ish so having a onetime cost of parsing and loading the data during package building is preferable.

Currently, I have a data.R file under R/ that assigns the data to package-level variables, the values are assigned during package installation (or at least that's what appears to be happening). This less than ideal setup mostly works. In order to get all instances of the package to have the same data I have to distribute the data file with the package (currently it's being copied to inst/ by a helper script before building the package) instead of just having it all be packaged together. There must be a better way.

Such as:

  • Generate .rda files during package building (but this requires not running the same code during package install)
    • I can do this with a Makefile but that seems like overkill
    • Can I have R code that is only run during package building and not during install?
  • Run R code in data/
    • But the data is munged using code in the package in question. I can fix that with Collate (I think) but then I have to maintain the order of all of the .R files (but with that added complexity I might as well use a Makefile?)
  • Build two packages, one with all of the code I want, one with the data.
  • Obvious, clever things I've not thought of.

tl;dr: What are some methods for adding a snapshot of dynamically changing data to an R package frozen for deployment?

like image 304
Tyler Avatar asked Dec 28 '12 19:12

Tyler


1 Answers

As @BenBolker points out in the comments above, splitting the dataset out into a different package has precedent in the community (most notably the core package datasets) and has additional benefits.

The separation of functions from data also makes working on historic versions of the data easier to do with the up to date functions.

I currently have an tools-to-munge package and a things-to-munge package. Using a helper script I can build the tools-to-munge and setup a Suggests (or Depends) in the DESCRIPTION of both packages to point to the appropriate incrementing version of the packages. After the new tools-to-munge package has been built I can build the things-to-munge package as necessary using the functions in the tools-to-munge package.

like image 161
Tyler Avatar answered Sep 19 '22 07:09

Tyler