Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bind data.frames row-wise in R without creating copies

I have a large list of data.frames that need to be bound pairwise by columns and then by rows prior to being fed into a predictive model. As no values will be modified, I would like to have the final data.frame pointing to the original data.frames in my list.

For example:

library(pryr)

#individual dataframes
df1 <- data.frame(a=1:1e6+0, b=1:1e6+1)
df2 <- data.frame(a=1:1e6+2, b=1:1e6+3)
df3 <- data.frame(a=1:1e6+4, b=1:1e6+5)

#each occupy 16MB
object_size(df1)  # 16 MB
object_size(df2)  # 16 MB
object_size(df3)  # 16 MB
object_size(df1, df2, df3)  # 48 MB

#will be in a named list
dfs <- list(df1=df1, df2=df2, df3=df3)

#putting into list doesn't create a copy
object_size(df1, df2, df3, dfs)  #48MB

Final data.frame will have this orientation (every unique pair of data.frames bound by columns, then pairs bound by rows):

df1, df2
df1, df3
df2, df3

I am currently implementing this as such:

#generate unique df combinations
df_names <- names(dfs)
pairs <- combn(df_names, 2, simplify=FALSE)

#bind dfs by columns
combo_dfs <- lapply(pairs, function(x) cbind(dfs[[x[1]]], dfs[[x[2]]]))

#no copies created yet
object_size(dfs, combo_dfs)  # 48MB

#bind dfs by rows
combo_df <- do.call(rbind, combo_dfs)

#now data gets copied
object_size(combo_df)  # 96 MB
object_size(dfs, combo_df)  # 144 MB

How can I avoid copying my data but still achieve the same end result?

like image 598
alexvpickering Avatar asked Apr 26 '16 16:04

alexvpickering


1 Answers

Storing the values as you hope to would require R to do some compression on the data frame. I don't believe data frames support compression.

If your motivation for wanting to store the data this way is difficulty fitting it in memory, you could try the ff package. This would allow you to store it in a more compact way on disk. The ffdf class seems to have the properties you need:

By default, creating an ’ffdf’ object will NOT create new ff files, instead existing files are ref- erenced. This differs from data.frame , which always creates copies of the input objects, most notably in data.frame(matrix()) , where an input matrix is converted to single columns. ffdf by contrast, will store an input matrix physically as the same matrix and virtually map it to columns.

In addition the ff package is optimized for fast access.

Note that I haven't used this package myself so I can't guarantee it will solve your problem.

like image 166
Sean Mullane Avatar answered Sep 24 '22 02:09

Sean Mullane