The read_html
function generates an xml_document which i would like to save and later on load it to parse it.
The problem is that after loading the xml_document there is no html within it.
library(rvest)
library(magrittr)
doc <- read_html("http://www.example.com/")
doc %>% html_node("h1") %>% html_text
I get: [1] "Example Domain"
But when I save first the xml_document doc
object and load it again it seems that everything has been cleared.
save(doc, file=paste0(getwd(), "/example.RData"))
rm(doc)
load(file=paste0(getwd(), "/example.RData"))
doc %>% html_node("h1") %>% html_text
I get: Error: No matches
Or when i run doc
i get: {xml_document}
an empty xml_document.
It is also the case that when i run the doc
, after having loaded it, i get a message that RStudio has stopped working.
I have tried it on two different windows machines, got the same problem.
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_1.5 rvest_0.3.1.9000 xml2_0.1.2
loaded via a namespace (and not attached):
[1] httr_1.1.0 R6_2.1.2 tools_3.3.0 Rcpp_0.12.5
I have found a workaround, not very efficient but it does the job.
The logic is to save the xml_document
as a string and read it in again with read_html
.
library(rvest)
library(magrittr)
doc <- read_html("http://www.example.com/")
# convert it to character
doc %<>% as("character")
save(doc, file=paste0(getwd(), "/example.RData"))
rm(doc)
load(file=paste0(getwd(), "/example.RData"))
doc %>% read_html %>% html_node("h1") %>% html_text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With