Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: serialize objects to text file and back again

I have a process in R that creates a bunch of objects, serializes them, and puts them into plain text files. This seemed like a really good way to handle things since I am working with Hadoop and all output needs to stream through stdin and stdout.

The problem I am left with is how to read these objects out of the text file and back into R on my desktop machine. Here's a working example that illustrates the challenge:

Let's create a tmp file and write a single object into it. This object is just a vector:

outCon <- file("c:/tmp", "w")
mychars <- rawToChar(serialize(1:10, NULL, ascii=T))
cat(mychars, file=outCon)
close(outCon)

The mychars object looks like this:

> mychars
[1] "A\n2\n133633\n131840\n13\n10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n"

when written to the text file it looks like this:

A
2
133633
131840
13
10
1
2
3
4
5
6
7
8
9
10

I'm probably overlooking something terribly obvious, but how do I read this file back into R and unserialize the object? When I try scan() or readLines() both want to treat the new line characters as record delimiters and I end up with a vector where each element is a row from the text file. What I really want is a text string with the whole contents of the file. Then I can unserialize the string.

Perl will read line breaks back into a string, but I can't figure out how to override the way R treats line breaks.

like image 642
JD Long Avatar asked Feb 13 '10 18:02

JD Long


People also ask

What is difference between deserialize and serialize?

Serialization is a mechanism of converting the state of an object into a byte stream. Deserialization is the reverse process where the byte stream is used to recreate the actual Java object in memory.

What happens when you serialize an object?

Serialization is the process of converting an object into a stream of bytes to store the object or transmit it to memory, a database, or a file. Its main purpose is to save the state of an object in order to be able to recreate it when needed. The reverse process is called deserialization.

Why do we need to serialize and deserialize data?

Serialization takes an in-memory data structure and converts it into a series of bytes that can be stored and transferred. Deserialization takes a series of bytes and converts it to an in-memory data structure that can be consumed programmatically.


1 Answers

JD, we do that in the digest package via serialize() to/from raw. That is nice as you can store serialized objects in SQL and other places. I would actually store this as RData as well which is way quicker to load() (no parsing!) and save().

Or, if it has to be RawToChar() and ascii then use something like this (taken straight from help(digest) where we compare serialization of the file COPYING:

 # test 'length' parameter and file input
 fname <- file.path(R.home(),"COPYING")
 x <- readChar(fname, file.info(fname)$size) # read file
 for (alg in c("sha1", "md5", "crc32")) {
   # partial file
   h1 <- digest(x    , length=18000, algo=alg, serialize=FALSE)
   h2 <- digest(fname, length=18000, algo=alg, serialize=FALSE, file=TRUE)
   h3 <- digest( substr(x,1,18000) , algo=alg, serialize=FALSE)
   stopifnot( identical(h1,h2), identical(h1,h3) )
   # whole file
   h1 <- digest(x    , algo=alg, serialize=FALSE)
   h2 <- digest(fname, algo=alg, serialize=FALSE, file=TRUE)
   stopifnot( identical(h1,h2) )
 }

so with that your example becomes this:

R> outCon <- file("/tmp/jd.txt", "w")
R> mychars <- rawToChar(serialize(1:10, NULL, ascii=T))
R> cat(mychars, file=outCon); close(outCon)
R> fname <- "/tmp/jd.txt"
R> readChar(fname, file.info(fname)$size)
[1] "A\n2\n133633\n131840\n13\n10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n"
R> unserialize(charToRaw(readChar(fname, file.info(fname)$size)))
[1]  1  2  3  4  5  6  7  8  9 10
R> 
like image 168
Dirk Eddelbuettel Avatar answered Nov 29 '22 22:11

Dirk Eddelbuettel