I have a complicated list object, the output of a modelling function (asreml). The object contains all sorts of data types, including functions and formulas, which have environments attached. I don't want to save the environments to RDS, because they are quite big and I save a lot of models.
I came across the parameter refhook=
in the serialize
and saveRDS
functions. The documentation says:
The refhook functions can be used to customize handling of non-system reference objects (all external pointers and weak references, and all environments other than namespace and package environments and .GlobalEnv). The hook function for serialize should return a character vector for references it wants to handle; otherwise it should return NULL.
Given this example model
e <- new.env()
e$a = rnorm(10)
l <- list(a = e, b = 42)
The refhook function indeed show some effect. The output gets smaller when I define a function which returns a character, indicating that the environment does not get saved:
length(serialize(l, connection = NULL))
[1] 338
s <- serialize(l,
connection = NULL,
refhook = function(x) "")
length(s)
[1] 109
However, I cannot read in the resulting object:
unserialize(s)
Error in unserialize(s) :
no restore method available
I also tried a raw vector output, suspecting that maybe refhook is expected to provide an alternative serialized output, but that won't work:
s2 <- serialize(l,
connection = NULL,
refhook = function(x)
serialize("env", connection = NULL)))
Error in serialize(l, con = NULL, refhook = function(x) serialize("env", :
assertion 'TYPEOF(t) == STRSXP && LENGTH(t) > 0' failed: file 'serialize.c', line 982
How do I use refhook=
? What character output is expected from this function?
Ah, I found it out myself. The error "no restore method available" means that you forgot to include a refhook for the unserialize
function. You need both, a refhook for serialize
and unserialize
.
The refhook of serialize
is completely free in what string to return. The only one who needs to understand the result is the refhook of unserialize
.
Generate a repository of environments. Lets pretend that these come from an external source and their contents don't need to be serialized. To restore them, the external data source just needs to be reread.
repo <- list()
for(i in 1:10){
repo[[i]] <- new.env()
repo[[i]]$a <- rnorm(1e6)
}
One environment is 8 MB large. We don't want to have all this data in our serialized output because it is already saved permanently in repo
.
object.size(repo[[1]]$a)
This is the list we want to serialize. It contains the second environment
from the repository. We just want to store the numeric value b
. For the
environment, we just want to store that it's the environment 2 from the
repository. We don't want to serialize the contents, because the repository
already has them.
l <- list(a = repo[[2]], b = 42)
This is the refhook for serialize. It looks up the environment in the index and just stores the index.
ser <- function(e){
for(i in seq_along(repo)){
if(identical(e, repo[[i]])){
message("Identified environment #",i)
return(as.character(i)) # Just save the
}
}
message("Environment not found in the repository")
return(NULL)
}
The corresponding refhook for unserialize takes the index and loads the
corresponding environment from repo
:
unser <- function(s){
i <- as.numeric(s)
return(repo[[i]])
}
This saves a lot of space in the serialized output
Without custom refhook: also contains the environment
object.size(serialize(l, con = NULL))
## 8000040 bytes
With custom refhook: Only l$b
and the environment index are saved
s <- serialize(l, con = NULL, refhook = ser)
object.size(s)
## 168 bytes
The environment is loaded from the database when unserialising
u <- unserialize(s, refhook = unser)
## $a
## <environment: 0x000000001c91a118>
##
## $b
## [1] 42
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With