Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - readRDS() & load() fail to give identical data.tables as the original

Background

I tried to replace some CSV output files with rds files to improve efficiency. These are intermediate files that will serve as inputs to other R scripts.

Question

I started investigating when my scripts failed and found that readRDS() and load() do not return identical data tables as the original. Is this supposed to happen? Or did I miss something?

Sample code

library( data.table )

aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T )  # Gives 'False'

aDF <- data.frame( a=1:10, b=LETTERS[1:10] )
saveRDS( aDF, file = "aDF.rds")
bDF <- readRDS( file = "aDF.rds" )
identical( aDF, bDF, ignore.environment = T )  # Gives 'True'

# Using 'save'& 'load' doesn't help either
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T )  # Gives 'False'

I am running R ver 3.2.0 on Linux Mint and have tested with data.table ver 1.9.4 and 1.9.5 (latest).

Searching in SO and google returned this and this but I don't think they answer this issue. I am still trying to figure out why my scripts failed when I switched to rds but I am starting with this.

Would appreciate it very much if knowledgeable SO members can help. Thanks!

Edit:

Hi everyone, I happened to find a way to resolve the issue - have posted the solution below. I apologise if it's rather inelegant. Now, I have 2 further questions:

(1) Is there a better way?

(2) Can something be done at the R and/or data.table code to resolve this? I mean, this issue causes unpredictable bugs and is not the first thing that comes to mind. My 2 cents worth.

like image 386
NoviceProg Avatar asked Jul 06 '15 16:07

NoviceProg


People also ask

What does readRDS do in R?

Details. saveRDS and readRDS provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name. This differs from save and load , which save and restore one or more named objects into an environment.

How do I edit an RDS file in R?

rds files are serialized, you don't want to edit the file directly. Load the object, change the things you need changed and then re-save the new object.


2 Answers

Probably, this has to do with pointers:

 attributes(aDT)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x0000000000390788>

> attributes(bDT)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: (nil)>

> attributes(bDF)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

> attributes(aDF)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

You can closely look at what's going using .Internal(inspect(.)) command:

.Internal(inspect(aDT))

 .Internal(inspect(bDT))
like image 145
user227710 Avatar answered Oct 27 '22 08:10

user227710


The newly loaded data.table doesn't know the pointer value of the already loaded one. You could tell it with

attributes(bDT)$.internal.selfref <- attributes(aDT)$.internal.selfref
identical( aDT, bDT, ignore.environment = T )
# [1] TRUE

data.frame don't keep this attribute, probably because they don't do in place modification.

like image 27
Rorschach Avatar answered Oct 27 '22 09:10

Rorschach