R: rvest extracting innerHTML

Question

Using rvest in R to scrape a web-page, I'd like to extract the equivalent of innerHTML from a node, in particular to change line-breaks into newlines before applying html_text.

Example of desired functionality:

library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")

Shall produce following output:

[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With rvest 0.2 this can be achieved through toString.XMLNode

# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>% 
  html_node(".pp") %>% 
  toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With the newer rvest 0.2.0.900 this does not work anymore.

# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>% 
  toString.XMLNode
[1] "{xml_node}
<p>
[1] <br/>"

The desired functionality is generally available in the write_xml function of package xml2 on which rvest now depends - if only write_xml could give its output to a variable instead of insisting to write to a file. (also a textConnection is not accepted).

As a workaround I can temporarily write to a file:

# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
  file <- tempfile()
  html_node(x,css) %>% write_xml(file)
  txt <- readLines(file, warn=FALSE)
  unlink(file)
  txt
}
html_innerHTML(doc, ".pp") 
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

with this I can then for example transform the line break tags into new-line characters:

html_innerHTML(doc, ".pp") %>% 
  gsub("<br\s*/?\s*>","
", .) %>%
  read_html %>%
  html_text
[1] "First Line
Second Line"

Is there a better way to do this with existing functions from e.g. rvest, xml2, XML or other packages? In particular I'd like to avoid to write to the hard disk.

akraf · Accepted Answer

As @r2evans noted, as.character(doc) is the solution.

Regarding you last code snippet, which wants to extract the <br>-separated text out of the node while converting <br> to newline, there is a workaround in the currently unresolved rvest issue #175, comment #2:

The simplified version for this problem:

doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')

# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"

# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "
"){
    paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line
Second Line"

R: rvest extracting innerHTML

Tags:

r

tostring

innerhtml

web-scraping

rvest

javrucebo

1 Answers

akraf

Recent Activity

Donate For Us

R: rvest extracting innerHTML

Tags:

r

tostring

innerhtml

web-scraping

rvest

javrucebo

1 Answers

akraf

Related questions

Recent Activity

Donate For Us