Using rvest in R to scrape a web-page, I'd like to extract the equivalent of innerHTML
from a node, in particular to change line-breaks into newlines before applying html_text
.
Example of desired functionality:
library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")
Shall produce following output:
[1] "<p class=\"pp\">First Line<br>Second Line</p>"
With rvest 0.2
this can be achieved through toString.XMLNode
# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>%
html_node(".pp") %>%
toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"
With the newer rvest 0.2.0.900
this does not work anymore.
# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>%
toString.XMLNode
[1] "{xml_node}\n<p>\n[1] <br/>"
The desired functionality is generally available in the write_xml
function of package xml2
on which rvest
now depends - if only write_xml
could give its output to a variable instead of insisting to write to a file. (also a textConnection
is not accepted).
As a workaround I can temporarily write to a file:
# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
file <- tempfile()
html_node(x,css) %>% write_xml(file)
txt <- readLines(file, warn=FALSE)
unlink(file)
txt
}
html_innerHTML(doc, ".pp")
[1] "<p class=\"pp\">First Line<br>Second Line</p>"
with this I can then for example transform the line break tags into new-line characters:
html_innerHTML(doc, ".pp") %>%
gsub("<br\\s*/?\\s*>","\n", .) %>%
read_html %>%
html_text
[1] "First Line\nSecond Line"
Is there a better way to do this with existing functions from e.g. rvest
, xml2
, XML
or other packages? In particular I'd like to avoid to write to the hard disk.
As @r2evans noted, as.character(doc)
is the solution.
Regarding you last code snippet, which wants to extract the <br>
-separated text out of the node while converting <br>
to newline, there is a workaround in the currently unresolved rvest issue #175, comment #2:
The simplified version for this problem:
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"
# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "\n"){
paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line\nSecond Line"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With