The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.
My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.
I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:
Moving large files to a data package and making the original package depend on the data package.
install.packages()
function as they would with any other CRAN package. Things work CRANtastic and everyone is happy.install.packages()
to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed as Suggests:
with an URL under Additional_repositories:
in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.
I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.
Installing the CRAN packages with the menuIn RStudio go to Tools → Install Packages and in the Install from option select Repository (CRAN) and then specify the packages you want. In classic R IDE go to Packages → Install package(s) , select a mirror and install the package.
To manually submit your package to CRAN, you create a package bundle (with devtools::build() ) then upload it to https://cran.r-project.org/submit.html, along with some comments which describe the process you followed.
Currently, the CRAN package repository features 18584 available packages. CRAN Task Views aim to provide some guidance which packages on CRAN are relevant for tasks related to a certain topic. They provide tools to automatically install all packages from each view.
You could have a function to install the data at a chosen location, and have the path stored in an option defined in your .R Profile: option(yourpackage.datapath = your path)
. You might suggest that the user stores it in your package installation path.
The installing function prints first the code above and proposes you to copy and paste it in your .RProfile while the data is downloading :
if(is.null(getOption("yourpackage.datapath")))
stop('you have not defined the "yourpackage.datapath" option, please make sure the data is installed using `yourpackage::install_yourdata", then copy `option(yourpackage.datapath = yourpath)` to your R profile.')
You could also open it using edit()
for instance. Or place it in your pastebin but you don't want extra dependencies and I think you'd need some to do this. I don't think CRAN will let you edit the .RProfile automatically but this is not too bad of a manual action. The installation function could check that the option is set before even downloading.
The data can be stored in a global variable of your namespace. You just need to define a environment object in your package and a function to modify it :
globals <- new.env()
load_data <- function(path) globals$data <- readRDS(path)
Then your functions will test if globals$data
is NULL
before either loading the data (after checking if path option was set properly) or moving on.
Once it's done, as long as the data or RProfile are not removed, it will work forever, and if they are removed the functions will catch it and give instructions as to how to fix the issue.
Another option here is to load the data in .onLoad, it means you'll have some logic in there to deal with the first time the package is loaded. As .onLoad knows the installation path through the libname argument you can even impose to download your data there, and load it right after you checked it's there (using a global variable as above) , so no need for options and RProfile.
As long as the user is prompted I think it will be fine with CRAN.
Have you tried using xz compression to reduce the size of your sysdata? I believe the default is gzip, with the compression level set to 6. If you use either bzip2 or xz compression when saving your package data with save()
, R will use these compression algorithms in conjunction with a compression level of 9. The upshot is that you get smaller package data objects.
The getNOAA.bathy()
function from the marmap
package has a keep
argument which defaults to FALSE
. If set to TRUE
, the dataset downloaded from the ETOPO1 database on NOAA servers is stored locally, in the working directory of the current R session. The argument Path
allows the user to specify where the dataset should be saved (version 1.0.5, available on GitHub but not on CRAN yet).
When the user calls getNOAA.bathy()
, the function first checks if the requested data is available locally, either in the current working directory or in the user provided path
. If it is (same bounding box and resolution), then the NOAA servers are not queried and the local data file is loaded instead. If not, the data is downloaded from NOAA servers. IMHO, this method has the following advantages:
keep=FALSE
: nothing is stored locally, which avoids adding too much clutter to the user's disk when loading many different test datasets.keep=TRUE
: the data is stored locally. Loading the data will be much faster the next time (and it can be done offline) since everything happens locally.getNOAA.bathy()
function is used to first download data from NOAA servers and load local files when available. The user does not have to worry to manually save the data, nor to alter his\her script to load local data the next time, since the function automatically loads the data from the most appropriate source (web server or internal disk).As far as I can tell, the only drawback is that on Windows machines, paths are limited to 250 characters, which might cause some trouble when generating filenames to save the data. Indeed, depending on the bounding box and resolution of the data downloaded on NOAA servers, filenames can be pretty long due to floating point arithmetics. An easy fix is to round the coordinates of the bounding box (using either round()
, ceiling()
or floor()
) to a few decimal places before generating the name of the file to save.
In general I wouldn't make it too hacky. I think there could be ways to trick the package to load additional data online during installation and add it to the package itself. Would be somehow nice - but I don't think it is popular with the CRAN maintainers.
What about the following ? :
In the CRAN package you import devtools
and with the .onLoad
method you install the Github data package with devtools::install_github
. (on load is called, when the package is loaded with library()/require()). You see this sometimes with package startup messages.
I could imagine the following advantages:
A implementation could look like this:
#' @import devtools
.onLoad <- function(libname, pkgname){
if (! "wordcloud" %in% utils::installed.packages()) {
message("installing data super dupa data package")
devtools::install_github("ifellows/wordcloud")
}
else {
require(wordcloud)
message("Everything fine, ready for usage!")
}
}
The .onLoad
has just to be out in any of your .R files. For your concrete implementation you could also refine this further. I don't have anything to to with the wordcloud package - was just the first thing I quickly found on GitHub as an example to install with install_github.
If there is an error message saying something with staged install - you have to add StagedInstall: no
to your DESCRIPTION
file.
Two alternatives that might be of interest:
Create an additional install
function that installs from Github the data package(s). The rnaturalearth
package has a great example with the install_rnaturalearthhires
function.
Use the pins
package to register a board_url
. The pins package works by downloading and storing the file on cache
. Whenever it is called it looks to the original url
to see if there were any changes. If there weren't, it uses the one it already has in memory. If it has no Internet connection, it also uses the one in memory. As an example we use the pins package in our covidmx
package to update COVID-19 data from the Internet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With