How can I save files in parallel without automatically increasing the file size?

Tags:

plyr

I have 2 scripts that doing exactly the same.

But one script is producing 3 RData files that weights 82.7 KB, and the other script creating 3 RData files that weights 120 KB.

the first one is without parallel:

library("plyr")
ddply(.data = iris,
      .variables = "Species",
      ##.parallel=TRUE,##Without parallel
      .fun = function(SpeciesData){

      #Create Simple Model -------------------------------------------------------------  
      Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)

      #Save The Model -------------------------------------------------------------               
       save(Model,
            compress = FALSE,
            file = gsub(x =  "Species.RData",
                        pattern = "Species",
                        replacement = unique(SpeciesData$Species)))

 })

The second is with parallel:

library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
ddply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){

      #Create Simple Model -------------------------------------------------------------  
      Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)

      #Save The Model -------------------------------------------------------------               
       save(Model,
            compress = FALSE,
            file = gsub(x =  "Species.RData",
                        pattern = "Species",
                        replacement = unique(SpeciesData$Species)))

 })
snow::stopCluster(cl)

the second script creates files that weight 42% more.

How can I save files in parallel without automatically increasing the file size?

323

asked Feb 16 '20 08:02

2 Answers

I have not used ddply to parallelize saving objects, so I guess the file gets much larger because when you save model object, it carrys also some information about the environment from which it is saved.

So using your ddply code above, the sizes I have are:

sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData  virginica.RData 
       36002            36002            36002

There are two options, one is to use purrr / furrr:

library(furrr)
library(purrr)

func = function(SpeciesData){
  Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
  save(Model,
       compress = FALSE,
       file = gsub(x =  "Species.RData",
                   pattern = "Species",
                   replacement = unique(SpeciesData$Species)))
}

split(iris,iris$Species) %>% future_map(func)

sapply(dir(pattern="RData"),file.size)
    setosa.RData versicolor.RData  virginica.RData 
           25426            27156            27156

Or to use saveRDS (and ddply?) since you only have one object to save:

ddply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){
        Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
        saveRDS(Model,
             gsub(x =  "Species.rds",
                         pattern = "Species",
                         replacement = unique(SpeciesData$Species)))

      })

sapply(dir(pattern="rds"),file.size)
    setosa.rds versicolor.rds  virginica.rds 
          6389           6300           6277

You will do readRDS instead of load to get the file:

m1 = readRDS("setosa.rds")
m1
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521

We can look at the coefficients in comparison with the rda object:

m2 = get(load("setosa.RData"))
m2

Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521

The objects are not identical because of the environment parts, but in terms of prediction or other stuff we normally use it for, it works:

identical(predict(m1,data.frame(iris[1:10,])),predict(m2,data.frame(iris[1:10,])))

106

answered Oct 21 '22 15:10

StupidWolf

As others mentioned, there might be some small amount of information about the environment that's being saved in the files or similar that you probably wouldn't notice except that the files are so small.

If you're just interested in file size, try saving the models into a single list and then save that into one file. ddply can only handle a data.frame as a result from the function, so we have to use dlply instead to tell it to store the results in a list. Doing this saved to just one file that was 60k.

Here's an example of what I'm talking about:

library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
models<-dlply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){

        #Create Simple Model -------------------------------------------------------------  
        lm(formula = Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = SpeciesData)
      })
snow::stopCluster(cl)

save(models, compress= FALSE, file= 'combined_models')

answered Oct 21 '22 17:10

Roger-123

Related questions
                            
                                Correct usage of dplyr::select in dplyr 0.7.0+, selecting columns using character vector
                            
                                How to enable emoji in Rmarkdown to show up after publishing in Shiny-Server
                            
                                Splitting and manipulating nested lists
                            
                                dplyr : how-to programmatically full_join dataframes contained in a list of lists?
                            
                                How can I compile RpostgreSQL with libssl and libpg and SSL activation
                            
                                Separate symbol and color in plotly legend
                            
                                Dplyr produces NaN while base R produces NA
                            
                                Combined fuzzy and exact matching
                            
                                Automatically reloading shiny app when add changes
                            
                                Function that asks a ~special~ something and returns an answer
                            
                                How can i plot multiple isochrone polygons using OSRM in R?
                            
                                Importing matplotlib with reticulate in R
                            
                                R reticulate unable to find installed python library
                            
                                Update a table using subquery in SQLite
                            
                                Data scraping from published Power BI visual
                            
                                Remove progress bar from knitr output
                            
                                Weird output of poisson GLM with an iid random effect in r
                            
                                When is R's assign() function appropriate?
                            
                                How does Branch Prediction affect performance in R?
                            
                                Python shared library not found, Python bindings not loaded. in RStudio on Mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I save files in parallel without automatically increasing the file size?

Tags:

r

plyr

Dima Ha

People also ask

2 Answers

StupidWolf

Roger-123

Recent Activity

Donate For Us