Package a large data set

Tags:

devtools

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.

I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().

One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.

Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?

909

asked Nov 05 '14 09:11

krlmlr

1 Answers

I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).

The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:

sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).

If you only want to select columns, I would go with RDS.

A rough example using RDS

The following code creates an example package containing the iris data set:

load_data <- function(dataset, columns) { 
  result <- vector("list", length(columns));
  for (i in seq_along(columns)) {
    col <- columns[i]
    fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
    result[[i]] <- readRDS(fn)
  }
  names(result) <- columns
  as.data.frame(result)
}

store_data <- function(package, name, data) {
  dir <- file.path(package, "inst", "exdata", name)
  dir.create(dir, recursive = TRUE)
  for (col in names(data)) {
    saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
  }
}

packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)

After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:

library(lazyload)
data <- load_data("iris", "Sepal.Width")

To load the Sepal.Width column of the iris data set.

Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

200

answered Sep 29 '22 03:09

Jan van der Laan

Related questions
                            
                                When does foreach call .combine?
                            
                                Converting zoo to ts before forecasting
                            
                                How to keep or remove grey margin for geom_tile on all facets? [duplicate]
                            
                                knit .rmd file to .md and save the .md file one level up with a different name
                            
                                error with train from caret package using method gam:
                            
                                Imitating 'ppoints' R function in python
                            
                                Designating bubble color as a variable is limiting choice in ggplot
                            
                                How to assign a specific color to NA in an image plot
                            
                                How to install the e1071 package in R 2.15.0
                            
                                Making multiple plots from a list of data frames
                            
                                How to remove only "actual numbers" from a string of characters in R
                            
                                Multinomial regression using multinom function in R
                            
                                Regular expression to find R code in Sweave expression
                            
                                Merging two dataframes on a date range in R
                            
                                coin::wilcox_test versus wilcox.test in R
                            
                                How do I reference a column in lapply which is not part of the SD?
                            
                                reading file using fread with row and column names
                            
                                rdply and .id argument - NULL doesn't work as described
                            
                                Unit testing Rcpp code in a package
                            
                                Data Smoothing in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With