Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save and read output of read_html as an RDS file?

Tags:

r

rvest

xml2

Objects can be saved and read like so

# Save as file
saveRDS(iris, "mydata.RDS")

# Read back in 
readRDS("mydata.RDS")

But this doesn't seem to work for objects made with xml2::read_html()

Example

library(rvest)
someobject <- read_html("https://stackoverflow.com/")
saveRDS(someobject, "someobject.RDS")

Which creates a file, but not as expected i.e.

readRDS("someobject.RDS")
Error in doc_is_html(x$doc) : external pointer is not valid

What's going on and what's the simplest way of saving an html object so that it can be loaded back in with minimal code/fuss?

like image 448
stevec Avatar asked Sep 03 '19 03:09

stevec


People also ask

How do I read a .RDS file in R?

rds extension. To read a R data file, invoke the readRDS() function. As with a CSV file, you can load a RDS file straight from a website, however, you must first run the file through a decompressor before attempting to load it via readRDS . A built-in decompressor function called gzcon can be used for this purpose.

How do I save an object in R?

To save data as an RData object, use the save function. To save data as a RDS object, use the saveRDS function. In each case, the first argument should be the name of the R object you wish to save. You should then include a file argument that has the file name or file path you want to save the data set to.

What is an RDS file in R?

Rds files store a single R object. According to R documentation: These functions provide the means to save a single R object to a connection (typically a file) and to restore the object, quite possibly under a different name.

How do I open .RDS files?

If you cannot open your RDS file correctly, try to right-click or long-press the file. Then click "Open with" and choose an application. You can also display a RDS file directly in the browser: Just drag the file onto this browser window and drop it.


Video Answer


3 Answers

To answer "what's going on": saveRDS is trying to serialize the object being saved. Here, the object someobject is a list with elements someobject$doc and someobject$node. The type of the elements is externalptr (external pointer), which means they reference a C data structure held in memory. When external pointers are serialized, the reference is lost. Hence the error "external pointer is not valid".

You could serialize someobject using as.character() and pass that to saveRDS:

saveRDS(as.character(someobject), "someobject.RDS")

Then recreate the object using readRDS and read_html:

someobject <- read_html(readRDS("someobject.RDS"))

But it's easier to use write_html() as others suggested.

Some discussion in this Github issue thread.

like image 137
neilfws Avatar answered Oct 17 '22 16:10

neilfws


We can use write_xml and read_html from xml2 package

before <- read_html("https://stackoverflow.com/")
xml2::write_xml(before, "someobject1.xml")
after <- xml2::read_html("someobject1.xml")

However, identical returns FALSE

identical(before, after)
#[1] FALSE

but the query on both of them seem to return the same result

library(rvest)
before %>%  html_nodes("div")
after %>% html_nodes("div")
like image 3
Ronak Shah Avatar answered Oct 17 '22 17:10

Ronak Shah


As far as I know the methods using XML and RDS files seem to be off by the same number of characters. I did a comparison and it seems like the differences between the original and the loaded version are in the body nodes.

url <-  "https://stackoverflow.com/"
html <- read_match(url)
html_node(html, "body")  %>% html_text() %>%  unlist() -> OBT
nchar(OBT)

28879

xml2::write_xml(html, "someobject1.xml")
html_node(html, "body")  %>% html_text() %>%  unlist() -> BT1
nchar(BT1)

28893

html   %>% toString %>% saveRDS(., "someobject.RDS")
after2 <- readRDS("someobject.RDS") %>% read_html
html_node(html, "body")  %>% html_text() %>%  unlist()-> BT2
nchar(BT2)

28893

This shows that the two loaded objects have the same number of characters. If we remove a "\n" characters from all text objects the number should be the same.

BT1 %>% str_remove_all(.,"\n") %>% nchar(.)

27733

BT2 %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

OBT %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

like image 3
SignorCasa Avatar answered Oct 17 '22 17:10

SignorCasa