Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RDS file size difference between ggplot2 objects created inside vs. outside function

Tags:

r

ggplot2

I am trying to build an R project that generates multiple ggplot2 objects using functions. However, I noticed that, when saving these objects as RDS files, the file sizes are much larger than I expected. I realized that saving an RDS object generated with a function, and the same plot in the global environment, give two very different file sizes, despite occupying equivalent memory in the R session. For example:

library(ggplot2)
data <- data.frame(x = rnorm(1e6))

p1 <- ggplot(data) + 
  geom_histogram(aes(x = x))

plot_fun <- function(y) {
  p <- ggplot(y) +
    geom_histogram(aes(x = x))
  return(p)
}

p2 <- plot_fun(data)

object.size(p1) # 8 Mb
object.size(p2) # 8 Mb

saveRDS(p1, "plot1.rds")
saveRDS(p2, "plot2.rds")

file.info("plot1.rds", "plot2.rds")

Does anyone know why this happens? Am I returning the object incorrectly from the function?

like image 295
bc_thaliana Avatar asked Jun 15 '18 23:06

bc_thaliana


2 Answers

This one is tricky. My initial advice was to use pryr::object_size(), which is more thorough about including the size of objects stored in the environment of an object, but that shows only a tiny difference between the two ggplot objects.

However, ggplot objects contain an environment, the $plot_env component, the contents of which will get stored along with the object.

The environment of p2$plot_env is that corresponding to the inside of your function:

ls(p2$plot_env)
# [1] "p" "y"

while the environment of p1$plot_env is the global environment, which contains a copy of the data as well as the other plot object ...

ls(p1$plot_env)
# [1] "data"     "p1"       "p2"       "plot_fun"

But this still seems a bit mysterious to me. p1 (with more objects in its environment) creates the smaller file size (7.4M), while p2 (with fewer objects) creates the larger file size (22M), and p1 naively seems to have more stuff stored:

sapply(p1$plot_env,object.size)
## plot_fun       p1       p2     data 
##     6568  8004632  8004632  8000728 
sapply(p2$plot_env,object.size)
##       p       y 
## 8004632 8000728 

Is this some kind of recursive nightmare where environments are referencing other environments, which all have to get stored? As @Chris says:

p2's environment has a parent environment of the global environment, while p1's environment is the global environment...I imag[in]e what is happening is that, when R needs to serialize an environment that inherits from another env (i.e., a parent env), it saves the parent env along with the child. That would explain why saving p1 would result in a smaller file size as compared to p2

If I replace the plotting environment of p2 with the global environment, the file size does get smaller ... and I think I didn't break the plotting object.

p2$plot_env <- p1$plot_env
saveRDS(p2, "plot2.rds")
system("ls -lht plot?.rds")
## -rw-r--r--  1 bolker  staff   7.4M 15 Jun 20:15 plot2.rds
## -rw-r--r--  1 bolker  staff   7.4M 15 Jun 20:14 plot1.rds

If your workflow allows it, you might consider storing rendered versions of these plots (as PDF/SVG/whatever) rather than the plot objects themselves ... although the plot objects are certainly more flexible.

like image 61
Ben Bolker Avatar answered Sep 28 '22 19:09

Ben Bolker


If you want to get an accurate size for your object, use: length(serialize(p1,NULL)). As stated above, this difference comes from the environments.

like image 33
T.Gulea Avatar answered Sep 28 '22 17:09

T.Gulea