Objects can be saved and read like so
# Save as file
saveRDS(iris, "mydata.RDS")
# Read back in
readRDS("mydata.RDS")
But this doesn't seem to work for objects made with xml2::read_html()
library(rvest)
someobject <- read_html("https://stackoverflow.com/")
saveRDS(someobject, "someobject.RDS")
Which creates a file, but not as expected i.e.
readRDS("someobject.RDS")
Error in doc_is_html(x$doc) : external pointer is not valid
What's going on and what's the simplest way of saving an html object so that it can be loaded back in with minimal code/fuss?
rds extension. To read a R data file, invoke the readRDS() function. As with a CSV file, you can load a RDS file straight from a website, however, you must first run the file through a decompressor before attempting to load it via readRDS . A built-in decompressor function called gzcon can be used for this purpose.
To save data as an RData object, use the save function. To save data as a RDS object, use the saveRDS function. In each case, the first argument should be the name of the R object you wish to save. You should then include a file argument that has the file name or file path you want to save the data set to.
Rds files store a single R object. According to R documentation: These functions provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name.
If you cannot open your RDS file correctly, try to right-click or long-press the file. Then click "Open with" and choose an application. You can also display a RDS file directly in the browser: Just drag the file onto this browser window and drop it.
To answer "what's going on": saveRDS
is trying to serialize the object being saved. Here, the object someobject
is a list with elements someobject$doc
and someobject$node
. The type of the elements is externalptr
(external pointer), which means they reference a C data structure held in memory. When external pointers are serialized, the reference is lost. Hence the error "external pointer is not valid".
You could serialize someobject
using as.character()
and pass that to saveRDS
:
saveRDS(as.character(someobject), "someobject.RDS")
Then recreate the object using readRDS
and read_html
:
someobject <- read_html(readRDS("someobject.RDS"))
But it's easier to use write_html()
as others suggested.
Some discussion in this Github issue thread.
We can use write_xml
and read_html
from xml2
package
before <- read_html("https://stackoverflow.com/")
xml2::write_xml(before, "someobject1.xml")
after <- xml2::read_html("someobject1.xml")
However, identical
returns FALSE
identical(before, after)
#[1] FALSE
but the query on both of them seem to return the same result
library(rvest)
before %>% html_nodes("div")
after %>% html_nodes("div")
As far as I know the methods using XML
and RDS
files seem to be off by the same number of characters. I did a comparison and it seems like the differences between the original and the loaded version are in the body nodes.
url <- "https://stackoverflow.com/"
html <- read_match(url)
html_node(html, "body") %>% html_text() %>% unlist() -> OBT
nchar(OBT)
28879
xml2::write_xml(html, "someobject1.xml")
html_node(html, "body") %>% html_text() %>% unlist() -> BT1
nchar(BT1)
28893
html %>% toString %>% saveRDS(., "someobject.RDS")
after2 <- readRDS("someobject.RDS") %>% read_html
html_node(html, "body") %>% html_text() %>% unlist()-> BT2
nchar(BT2)
28893
This shows that the two loaded objects have the same number of characters. If we remove a "\n" characters from all text objects the number should be the same.
BT1 %>% str_remove_all(.,"\n") %>% nchar(.)
27733
BT2 %>% str_remove_all(.,"\n") %>% nchar(.)
27733
OBT %>% str_remove_all(.,"\n") %>% nchar(.)
27733
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With