Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading an RDS file within a zip file without extracting to disk

Tags:

import

r

zip

Is there a reason I can't read a RDS file from within a zip file directly, without having to unzip it to a temp file on disk first?

Let's say this is the zip file:

saveRDS(cars, "cars.rds")
saveRDS(iris, "iris.rds")
write.csv(iris, "iris.csv")
zip("datasets.zip", c("cars.rds", "iris.rds", "iris.csv"))
file.remove("cars.rds", "iris.rds", "iris.csv")

For the csv file, I could read it directly like this:

iris2 <- read.csv(unz("datasets.zip", "iris.csv"))

However, I don't understand why I can't use unz() directly with readRDS():

iris3 <- readRDS(unz("datasets.zip", "iris.rds"))

This gives me the error:

Error: unknown input format

I'd also like to understand why this happens. I'm aware that I could do the following, as in this question:

path <- unzip("datasets.zip", "iris.rds")
iris4 <- readRDS(path)
file.remove(path)

This doesn't seem as efficient, though, and I need to do it frequently for a really large number of files, so I/O inefficiencies matter. Is there any workaround to read the rds file without extracting it to disk?

like image 291
cocquemas Avatar asked Oct 22 '15 20:10

cocquemas


People also ask

How do you check the contents of a zip file in Linux without extracting?

Lucky for you, the unzip command has the -l option that displays the contents of a zip file without extracting them. To view a ZIP file's contents, run the unzip command to list ( -l ) the zip file's ( newdir. zip ) contents without extracting them.

How do I view the contents of a zip file?

When you have a single file in the zip archive, you can use one of the following commands to read them: zcat, zless and zmore. These commands will not work if the zip archive contains more than one file. Use the zcat command to read the contents of the . zip file.

How do I extract data from a zip file?

To unzip filesOpen File Explorer and find the zipped folder. To unzip the entire folder, right-click to select Extract All, and then follow the instructions. To unzip a single file or folder, double-click the zipped folder to open it. Then, drag or copy the item from the zipped folder to a new location.


1 Answers

This was a little tricky to track down until I read the body of readRDS(). What it seems you need to do is

  1. Open a connection to the .zip archive and the file inside it with unz()
  2. Apply GZIP decompression to this connection using gzcon()
  3. And finally pass this decompressed connection to readRDS().

Here's an example to illustrate using the following serialised matrix mat inside a zip archive matrix.zip

mat <- matrix(1:9, ncol = 3)
saveRDS(mat, "matrix.rds")
zip("matrix.zip", "matrix.rds")

Open a connection to matrix.zip

con <- unz("matrix.zip", filename = "matrix.rds")

Now, using gzcon(), apply GZIP decompression to this connection

con2 <- gzcon(con)

Finally, read from the connection

mat2 <- readRDS(con2)

In full we have

con <- unz("matrix.zip", filename = "matrix.rds")
con2 <- gzcon(con)
mat2 <- readRDS(con2)
close(con2)

This gives

> con <- unz("matrix.zip", filename = "matrix.rds")
> con2 <- gzcon(con)
> mat2 <- readRDS(con2)
> close(con2)
> mat2
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> all.equal(mat, mat2)
[1] TRUE

Why?

Why you have to go through this convoluted extra step is (I think) described in ?readRDS:

Compression is handled by the connection opened when file is a file name, so is only possible when file is a connection if handled by the connection. So e.g. url connections will need to be wrapped in a call to gzcon.

And if you look at the internals of readRDS() we see:

> readRDS
function (file, refhook = NULL) 
{
    if (is.character(file)) {
        con <- gzfile(file, "rb")
        on.exit(close(con))
    }
    else if (inherits(file, "connection")) 
        con <- file
    else stop("bad 'file' argument")
    .Internal(unserializeFromConn(con, refhook))
}
<bytecode: 0x2841998>
<environment: namespace:base>

If file is a character string for the file name, the object is decompressed using gzile() to create the connection to the .rds we want to read. Notice that if you pass a connection as file, as you want to do, at no point has R decompressed the connection. file is just assigned to con and then passed to the internal function unserializeFromConn. Hence wrapping gzcon() around the connection created by unz works.

Basically, when unserializeFromConn reads from a connection it expects it to be decompressed but that decompression only happen automagically when you pass readRDS() a filename, not a connection.

like image 178
Gavin Simpson Avatar answered Oct 13 '22 13:10

Gavin Simpson