Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Package that downloads data from the internet during installation

Tags:

r

packaging

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?

EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.

I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.

like image 421
krlmlr Avatar asked Feb 14 '13 09:02

krlmlr


1 Answers

I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.

myFunction <- function(this, that, dataset) {

  # We're giving the user a chance to specify the dataset.
  #   Maybe they have already downloaded it and saved it.
  if (is.null(dataset)) {

    # Check to see if the object is already in the workspace.
    # If it is not, check to see whether the .RData file that
    #   contains the object is in the current working directory.
    if (!exists("OBJECTNAME", where = 1)) {
      if (isTRUE(list.files(
        pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
        load("DATAFILE.RData")

        # If neither of those are successful, prompt the user
        #   to download the dataset.
      } else {
        ans = readline(
          "DATAFILE.RData dataset not found in working directory.
          OBJECTNAME object not found in workspace. \n
          Download and load the dataset now? (y/n) ")
        if (ans != "y")
          return(invisible())

        # I usually use RCurl in case the URL is https
        require(RCurl)
        baseURL = c("http://some/base/url/")

        # Here, we actually download the data
        temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))

        # Here we load the data
        load(rawConnection(temp), envir=.GlobalEnv)
        message("OBJECTNAME data downloaded from \n",
                paste0(baseURL, "DATAFILE.RData \n"), 
                "and added to your workspace\n\n")
        rm(temp, baseURL)
      }
    }
    dataset <- OBJECTNAME
  }
  TEMP <- dataset
  ## Other fun stuff with TEMP, this, and that.
}

Two packages, hosted at Github

Here's another approach, building on the comments between @juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:

  1. Check to see if the data package is installed
  2. Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.

When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.

CheckVersionFirst <- function() {
  # Check to see if installed
  if (!"StataDCTutils" %in% installed.packages()[, 1]) {
    Checks <- "Failed"
  } else {
    # Compare version numbers
    require(RCurl)
    temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
    CurrentVersion <- gsub("^\\s|\\s$", "", 
                           gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
    if (packageVersion("StataDCTutils") == CurrentVersion) {
      Checks <- "Passed"
    }
    if (packageVersion("StataDCTutils") < CurrentVersion) {
      Checks <- "Failed"
    }
  }

  switch(
    Checks,
    Passed = { message("Everything looks OK! Proceeding!") },
    Failed = {
      ans = readline(
        "'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
      if (ans != "y")
        return(invisible())
      require(devtools)
      install_github("StataDCTutils", "mrdwab")
    })
# Some cool things you want to do after you are sure the data is there
}

Try it out with CheckVersionFirst().

Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!

So, to clarify/recap/expand, the basic idea would be to:

  • Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
  • Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
  • Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
like image 186
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 29 '22 17:09

A5C1D2H2I1M1N2O1R2T1