Is there a reason I can't read a RDS file from within a zip file directly, without having to unzip it to a temp file on disk first? Let's say this is the zip file: <pre class="prettyprint"><code>saveRDS(cars, "cars.rds") saveRDS(iris, "iris.rds") write.csv(iris, "iris.csv") zip("datasets.zip", c("cars.rds", "iris.rds", "iris.csv")) file.remove("cars.rds", "iris.rds", "iris.csv") </code></pre> For the csv file, I could read it directly like this: <pre class="prettyprint"><code>iris2 <- read.csv(unz("datasets.zip", "iris.csv")) </code></pre> However, I don't understand why I can't use <code>unz()</code> directly with <code>readRDS()</code>: <pre class="prettyprint"><code>iris3 <- readRDS(unz("datasets.zip", "iris.rds")) </code></pre> This gives me the error: <pre class="prettyprint"><code>Error: unknown input format </code></pre> I'd also like to understand why this happens. I'm aware that I could do the following, as in this question: <pre class="prettyprint"><code>path <- unzip("datasets.zip", "iris.rds") iris4 <- readRDS(path) file.remove(path) </code></pre> This doesn't seem as efficient, though, and I need to do it frequently for a really large number of files, so I/O inefficiencies matter. Is there any workaround to read the rds file without extracting it to disk?

This was a little tricky to track down until I read the body of <code>readRDS()</code>. What it seems you need to do is <ol> <li>Open a connection to the <code>.zip</code> archive and the file inside it with <code>unz()</code> </li> <li>Apply GZIP decompression to this connection using <code>gzcon()</code> </li> <li>And finally pass this decompressed connection to <code>readRDS()</code>.</li> </ol> Here's an example to illustrate using the following serialised matrix <code>mat</code> inside a zip archive <code>matrix.zip</code> <pre class="prettyprint"><code>mat <- matrix(1:9, ncol = 3) saveRDS(mat, "matrix.rds") zip("matrix.zip", "matrix.rds") </code></pre> Open a connection to <code>matrix.zip</code> <pre class="prettyprint"><code>con <- unz("matrix.zip", filename = "matrix.rds") </code></pre> Now, using <code>gzcon()</code>, apply GZIP decompression to this connection <pre class="prettyprint"><code>con2 <- gzcon(con) </code></pre> Finally, read from the connection <pre class="prettyprint"><code>mat2 <- readRDS(con2) </code></pre> In full we have <pre class="prettyprint"><code>con <- unz("matrix.zip", filename = "matrix.rds") con2 <- gzcon(con) mat2 <- readRDS(con2) close(con2) </code></pre> This gives <pre class="prettyprint"><code>> con <- unz("matrix.zip", filename = "matrix.rds") > con2 <- gzcon(con) > mat2 <- readRDS(con2) > close(con2) > mat2 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > all.equal(mat, mat2) [1] TRUE </code></pre> <h3>Why?</h3> Why you have to go through this convoluted extra step is (I think) described in <code>?readRDS</code>: <blockquote> Compression is handled by the connection opened when <code>file</code> is a file name, so is only possible when <code>file</code> is a connection if handled by the connection. So e.g. <code>url</code> connections will need to be wrapped in a call to <code>gzcon</code>. </blockquote> And if you look at the internals of <code>readRDS()</code> we see: <pre class="prettyprint"><code>> readRDS function (file, refhook = NULL) { if (is.character(file)) { con <- gzfile(file, "rb") on.exit(close(con)) } else if (inherits(file, "connection")) con <- file else stop("bad 'file' argument") .Internal(unserializeFromConn(con, refhook)) } <bytecode: 0x2841998> <environment: namespace:base> </code></pre> If <code>file</code> is a character string for the file name, the object is decompressed using <code>gzile()</code> to create the connection to the <code>.rds</code> we want to read. Notice that if you pass a connection as <code>file</code>, as you want to do, at no point has R decompressed the connection. <code>file</code> is just assigned to <code>con</code> and then passed to the internal function <code>unserializeFromConn</code>. Hence wrapping <code>gzcon()</code> around the connection created by <code>unz</code> works. Basically, when <code>unserializeFromConn</code> reads from a connection it expects it to be decompressed but that decompression only happen automagically when you pass <code>readRDS()</code> a filename, not a connection.

Reading an RDS file within a zip file without extracting to disk

Is there a reason I can't read a RDS file from within a zip file directly, without having to unzip it to a temp file on disk first?

Let's say this is the zip file:

saveRDS(cars, "cars.rds")
saveRDS(iris, "iris.rds")
write.csv(iris, "iris.csv")
zip("datasets.zip", c("cars.rds", "iris.rds", "iris.csv"))
file.remove("cars.rds", "iris.rds", "iris.csv")

For the csv file, I could read it directly like this:

iris2 <- read.csv(unz("datasets.zip", "iris.csv"))

However, I don't understand why I can't use unz() directly with readRDS():

iris3 <- readRDS(unz("datasets.zip", "iris.rds"))

This gives me the error:

Error: unknown input format

I'd also like to understand why this happens. I'm aware that I could do the following, as in this question:

path <- unzip("datasets.zip", "iris.rds")
iris4 <- readRDS(path)
file.remove(path)

This doesn't seem as efficient, though, and I need to do it frequently for a really large number of files, so I/O inefficiencies matter. Is there any workaround to read the rds file without extracting it to disk?

How do you check the contents of a zip file in Linux without extracting?

Lucky for you, the unzip command has the -l option that displays the contents of a zip file without extracting them. To view a ZIP file's contents, run the unzip command to list ( -l ) the zip file's ( newdir. zip ) contents without extracting them.

How do I view the contents of a zip file?

When you have a single file in the zip archive, you can use one of the following commands to read them: zcat, zless and zmore. These commands will not work if the zip archive contains more than one file. Use the zcat command to read the contents of the . zip file.

How do I extract data from a zip file?

To unzip filesOpen File Explorer and find the zipped folder. To unzip the entire folder, right-click to select Extract All, and then follow the instructions. To unzip a single file or folder, double-click the zipped folder to open it. Then, drag or copy the item from the zipped folder to a new location.

This was a little tricky to track down until I read the body of readRDS(). What it seems you need to do is

Open a connection to the .zip archive and the file inside it with unz()
Apply GZIP decompression to this connection using gzcon()
And finally pass this decompressed connection to readRDS().

Here's an example to illustrate using the following serialised matrix mat inside a zip archive matrix.zip

mat <- matrix(1:9, ncol = 3)
saveRDS(mat, "matrix.rds")
zip("matrix.zip", "matrix.rds")

Open a connection to matrix.zip

con <- unz("matrix.zip", filename = "matrix.rds")

Now, using gzcon(), apply GZIP decompression to this connection

con2 <- gzcon(con)

Finally, read from the connection

mat2 <- readRDS(con2)

In full we have

con <- unz("matrix.zip", filename = "matrix.rds")
con2 <- gzcon(con)
mat2 <- readRDS(con2)
close(con2)

This gives

> con <- unz("matrix.zip", filename = "matrix.rds")
> con2 <- gzcon(con)
> mat2 <- readRDS(con2)
> close(con2)
> mat2
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> all.equal(mat, mat2)
[1] TRUE

Why?

Why you have to go through this convoluted extra step is (I think) described in ?readRDS:

Compression is handled by the connection opened when file is a file name, so is only possible when file is a connection if handled by the connection. So e.g. url connections will need to be wrapped in a call to gzcon.

And if you look at the internals of readRDS() we see:

> readRDS
function (file, refhook = NULL) 
{
    if (is.character(file)) {
        con <- gzfile(file, "rb")
        on.exit(close(con))
    }
    else if (inherits(file, "connection")) 
        con <- file
    else stop("bad 'file' argument")
    .Internal(unserializeFromConn(con, refhook))
}
<bytecode: 0x2841998>
<environment: namespace:base>

If file is a character string for the file name, the object is decompressed using gzile() to create the connection to the .rds we want to read. Notice that if you pass a connection as file, as you want to do, at no point has R decompressed the connection. file is just assigned to con and then passed to the internal function unserializeFromConn. Hence wrapping gzcon() around the connection created by unz works.

Basically, when unserializeFromConn reads from a connection it expects it to be decompressed but that decompression only happen automagically when you pass readRDS() a filename, not a connection.

Reading an RDS file within a zip file without extracting to disk

Tags:

import

r

zip

cocquemas

People also ask

1 Answers

Why?

Gavin Simpson

Recent Activity

Donate For Us

Reading an RDS file within a zip file without extracting to disk

Tags:

import

r

zip

cocquemas

People also ask

1 Answers

Why?

Gavin Simpson

Related questions

Recent Activity

Donate For Us