Column-wise storage in the
inst/extdata
directory of a package, as suggested by Jan, is now implemented in thedfunbind
package.
I'm using the data-raw
idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library()
.
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()
), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff
package help? Can anyone point me to a working example?
Thousands or lakhs of data are small data. But, millions of data are called as large data. Partition based clustering algorithms are fit for large data.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
I think, I would store the data in inst/extdata
. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage")
. (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite
database. You can then perform queries on this data using the rsqlite
package. ff
objects (e.g. save using the save.ffdf
function from ffbase
; use load.ffdf
to load again). ff
doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok. LaF
package. The performance will probably be less than with ff
but might be good enough.saveRDS
) and load them using readRDS
the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).If you only want to select columns, I would go with RDS.
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width
column of the iris data set.
Of course this is a very simple implementation of load_data
: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With