I have 2 scripts that doing exactly the same.
But one script is producing 3 RData files that weights 82.7 KB, and the other script creating 3 RData files that weights 120 KB.
the first one is without parallel:
library("plyr")
ddply(.data = iris,
.variables = "Species",
##.parallel=TRUE,##Without parallel
.fun = function(SpeciesData){
#Create Simple Model -------------------------------------------------------------
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
#Save The Model -------------------------------------------------------------
save(Model,
compress = FALSE,
file = gsub(x = "Species.RData",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
})
The second is with parallel:
library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
ddply(.data = iris,
.variables = "Species",
.parallel=TRUE,##With parallel
.fun = function(SpeciesData){
#Create Simple Model -------------------------------------------------------------
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
#Save The Model -------------------------------------------------------------
save(Model,
compress = FALSE,
file = gsub(x = "Species.RData",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
})
snow::stopCluster(cl)
the second script creates files that weight 42% more.
How can I save files in parallel without automatically increasing the file size?
Throughput modes. A file system's throughput mode determines the throughput available to your file system. Amazon EFS offers two throughput modes, Bursting Throughput and Provisioned Throughput. Read throughput is discounted to allow you to drive higher read throughput than write throughput.
A system is said to be concurrent if it can support two or more actions in progress at the same time. A system is said to be parallel if it can support two or more actions executing simultaneously. The key concept and difference between these definitions is the phrase “in progress.”
GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially, and output names can be easily tied to input file names for simple post-processing. This makes it possible to use output from GNU parallel as input for other programs.
I have not used ddply to parallelize saving objects, so I guess the file gets much larger because when you save model object, it carrys also some information about the environment from which it is saved.
So using your ddply code above, the sizes I have are:
sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData virginica.RData
36002 36002 36002
There are two options, one is to use purrr / furrr:
library(furrr)
library(purrr)
func = function(SpeciesData){
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
save(Model,
compress = FALSE,
file = gsub(x = "Species.RData",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
}
split(iris,iris$Species) %>% future_map(func)
sapply(dir(pattern="RData"),file.size)
setosa.RData versicolor.RData virginica.RData
25426 27156 27156
Or to use saveRDS (and ddply?) since you only have one object to save:
ddply(.data = iris,
.variables = "Species",
.parallel=TRUE,##With parallel
.fun = function(SpeciesData){
Model <- lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",data = SpeciesData)
saveRDS(Model,
gsub(x = "Species.rds",
pattern = "Species",
replacement = unique(SpeciesData$Species)))
})
sapply(dir(pattern="rds"),file.size)
setosa.rds versicolor.rds virginica.rds
6389 6300 6277
You will do readRDS
instead of load
to get the file:
m1 = readRDS("setosa.rds")
m1
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",
data = SpeciesData)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.3519 0.6548 0.2376 0.2521
We can look at the coefficients in comparison with the rda object:
m2 = get(load("setosa.RData"))
m2
Call:
lm(formula = "Sepal.Length~Sepal.Width+Petal.Length+Petal.Width",
data = SpeciesData)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.3519 0.6548 0.2376 0.2521
The objects are not identical because of the environment parts, but in terms of prediction or other stuff we normally use it for, it works:
identical(predict(m1,data.frame(iris[1:10,])),predict(m2,data.frame(iris[1:10,])))
As others mentioned, there might be some small amount of information about the environment that's being saved in the files or similar that you probably wouldn't notice except that the files are so small.
If you're just interested in file size, try saving the models into a single list and then save that into one file. ddply
can only handle a data.frame as a result from the function, so we have to use dlply
instead to tell it to store the results in a list. Doing this saved to just one file that was 60k.
Here's an example of what I'm talking about:
library("plyr")
doSNOW::registerDoSNOW(cl<-snow::makeCluster(3))
models<-dlply(.data = iris,
.variables = "Species",
.parallel=TRUE,##With parallel
.fun = function(SpeciesData){
#Create Simple Model -------------------------------------------------------------
lm(formula = Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = SpeciesData)
})
snow::stopCluster(cl)
save(models, compress= FALSE, file= 'combined_models')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With