Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Package a large data set

Tags:

r

devtools

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.

I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().

One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.

Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?

like image 909
krlmlr Avatar asked Nov 05 '14 09:11

krlmlr


People also ask

What is considered a large data set?

Thousands or lakhs of data are small data. But, millions of data are called as large data. Partition based clustering algorithms are fit for large data.

How do I import a large dataset in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.


1 Answers

I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).

The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:

  • sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
  • ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
  • CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
  • RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).

If you only want to select columns, I would go with RDS.

A rough example using RDS

The following code creates an example package containing the iris data set:

load_data <- function(dataset, columns) { 
  result <- vector("list", length(columns));
  for (i in seq_along(columns)) {
    col <- columns[i]
    fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
    result[[i]] <- readRDS(fn)
  }
  names(result) <- columns
  as.data.frame(result)
}

store_data <- function(package, name, data) {
  dir <- file.path(package, "inst", "exdata", name)
  dir.create(dir, recursive = TRUE)
  for (col in names(data)) {
    saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
  }
}

packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)

After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:

library(lazyload)
data <- load_data("iris", "Sepal.Width")

To load the Sepal.Width column of the iris data set.

Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

like image 200
Jan van der Laan Avatar answered Sep 29 '22 03:09

Jan van der Laan