Background
I tried to replace some CSV
output files with rds
files to improve efficiency. These are intermediate files that will serve as inputs to other R scripts.
Question
I started investigating when my scripts failed and found that readRDS()
and load()
do not return identical data tables
as the original. Is this supposed to happen? Or did I miss something?
Sample code
library( data.table )
aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T ) # Gives 'False'
aDF <- data.frame( a=1:10, b=LETTERS[1:10] )
saveRDS( aDF, file = "aDF.rds")
bDF <- readRDS( file = "aDF.rds" )
identical( aDF, bDF, ignore.environment = T ) # Gives 'True'
# Using 'save'& 'load' doesn't help either
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T ) # Gives 'False'
I am running R ver 3.2.0 on Linux Mint and have tested with data.table
ver 1.9.4 and 1.9.5 (latest).
Searching in SO and google returned this and this but I don't think they answer this issue. I am still trying to figure out why my scripts failed when I switched to rds
but I am starting with this.
Would appreciate it very much if knowledgeable SO members can help. Thanks!
Edit:
Hi everyone, I happened to find a way to resolve the issue - have posted the solution below. I apologise if it's rather inelegant. Now, I have 2 further questions:
(1) Is there a better way?
(2) Can something be done at the R
and/or data.table
code to resolve this? I mean, this issue causes unpredictable bugs and is not the first thing that comes to mind. My 2 cents worth.
Details. saveRDS and readRDS provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name. This differs from save and load , which save and restore one or more named objects into an environment.
rds files are serialized, you don't want to edit the file directly. Load the object, change the things you need changed and then re-save the new object.
Probably, this has to do with pointers:
attributes(aDT)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x0000000000390788>
> attributes(bDT)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: (nil)>
> attributes(bDF)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.frame"
> attributes(aDF)
$names
[1] "a" "b"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10
$class
[1] "data.frame"
You can closely look at what's going using .Internal(inspect(.))
command:
.Internal(inspect(aDT))
.Internal(inspect(bDT))
The newly loaded data.table
doesn't know the pointer value of the already loaded one. You could tell it with
attributes(bDT)$.internal.selfref <- attributes(aDT)$.internal.selfref
identical( aDT, bDT, ignore.environment = T )
# [1] TRUE
data.frame
don't keep this attribute, probably because they don't do in place modification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With