Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I save files in parallel without automatically increasing the file size?

Tags:

r

plyr

I have 2 scripts that doing exactly the same.

But one script is producing 3 RData files that weights 82.7 KB, and the other script creating 3 RData files that weights 120 KB.

the first one is without parallel:

library("plyr")
ddply(.data = iris,
      .variables = "Species",
      ##.parallel=TRUE,##Without parallel
      .fun = function(SpeciesData){

      #Create Simple Model -------------------------------------------------------------  
      Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)

      #Save The Model -------------------------------------------------------------               
       save(Model,
            compress = FALSE,
            file = gsub(x =  "Species.RData",
                        pattern = "Species",
                        replacement = unique(SpeciesData$Species)))

 })

The second is with parallel:

library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
ddply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){

      #Create Simple Model -------------------------------------------------------------  
      Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)

      #Save The Model -------------------------------------------------------------               
       save(Model,
            compress = FALSE,
            file = gsub(x =  "Species.RData",
                        pattern = "Species",
                        replacement = unique(SpeciesData$Species)))

 })
snow::stopCluster(cl)

the second script creates files that weight 42% more.

How can I save files in parallel without automatically increasing the file size?

like image 323
Dima Ha Avatar asked Feb 16 '20 08:02

Dima Ha


People also ask

What is throughput mode?

Throughput modes. A file system's throughput mode determines the throughput available to your file system. Amazon EFS offers two throughput modes, Bursting Throughput and Provisioned Throughput. Read throughput is discounted to allow you to drive higher read throughput than write throughput.

What is parallel progress?

A system is said to be concurrent if it can support two or more actions in progress at the same time. A system is said to be parallel if it can support two or more actions executing simultaneously. The key concept and difference between these definitions is the phrase “in progress.”

How does GNU parallel work?

GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially, and output names can be easily tied to input file names for simple post-processing. This makes it possible to use output from GNU parallel as input for other programs.


2 Answers

I have not used ddply to parallelize saving objects, so I guess the file gets much larger because when you save model object, it carrys also some information about the environment from which it is saved.

So using your ddply code above, the sizes I have are:

sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData  virginica.RData 
       36002            36002            36002 

There are two options, one is to use purrr / furrr:

library(furrr)
library(purrr)

func = function(SpeciesData){
  Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
  save(Model,
       compress = FALSE,
       file = gsub(x =  "Species.RData",
                   pattern = "Species",
                   replacement = unique(SpeciesData$Species)))
}

split(iris,iris$Species) %>% future_map(func)

sapply(dir(pattern="RData"),file.size)
    setosa.RData versicolor.RData  virginica.RData 
           25426            27156            27156

Or to use saveRDS (and ddply?) since you only have one object to save:

ddply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){
        Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
        saveRDS(Model,
             gsub(x =  "Species.rds",
                         pattern = "Species",
                         replacement = unique(SpeciesData$Species)))

      })

sapply(dir(pattern="rds"),file.size)
    setosa.rds versicolor.rds  virginica.rds 
          6389           6300           6277 

You will do readRDS instead of load to get the file:

m1 = readRDS("setosa.rds")
m1
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521  

We can look at the coefficients in comparison with the rda object:

m2 = get(load("setosa.RData"))
m2

Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width", 
    data = SpeciesData)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521  

The objects are not identical because of the environment parts, but in terms of prediction or other stuff we normally use it for, it works:

identical(predict(m1,data.frame(iris[1:10,])),predict(m2,data.frame(iris[1:10,])))
like image 106
StupidWolf Avatar answered Oct 21 '22 15:10

StupidWolf


As others mentioned, there might be some small amount of information about the environment that's being saved in the files or similar that you probably wouldn't notice except that the files are so small.

If you're just interested in file size, try saving the models into a single list and then save that into one file. ddply can only handle a data.frame as a result from the function, so we have to use dlply instead to tell it to store the results in a list. Doing this saved to just one file that was 60k.

Here's an example of what I'm talking about:

library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
models<-dlply(.data = iris,
      .variables = "Species",
      .parallel=TRUE,##With parallel
      .fun = function(SpeciesData){

        #Create Simple Model -------------------------------------------------------------  
        lm(formula = Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = SpeciesData)
      })
snow::stopCluster(cl)

save(models, compress= FALSE, file= 'combined_models')
like image 21
Roger-123 Avatar answered Oct 21 '22 17:10

Roger-123