Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: rvest extracting innerHTML

Using rvest in R to scrape a web-page, I'd like to extract the equivalent of innerHTML from a node, in particular to change line-breaks into newlines before applying html_text.

Example of desired functionality:

library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")

Shall produce following output:

[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With rvest 0.2 this can be achieved through toString.XMLNode

# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>% 
  html_node(".pp") %>% 
  toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With the newer rvest 0.2.0.900 this does not work anymore.

# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>% 
  toString.XMLNode
[1] "{xml_node}\n<p>\n[1] <br/>"

The desired functionality is generally available in the write_xml function of package xml2 on which rvest now depends - if only write_xml could give its output to a variable instead of insisting to write to a file. (also a textConnection is not accepted).

As a workaround I can temporarily write to a file:

# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
  file <- tempfile()
  html_node(x,css) %>% write_xml(file)
  txt <- readLines(file, warn=FALSE)
  unlink(file)
  txt
}
html_innerHTML(doc, ".pp") 
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

with this I can then for example transform the line break tags into new-line characters:

html_innerHTML(doc, ".pp") %>% 
  gsub("<br\\s*/?\\s*>","\n", .) %>%
  read_html %>%
  html_text
[1] "First Line\nSecond Line"

Is there a better way to do this with existing functions from e.g. rvest, xml2, XML or other packages? In particular I'd like to avoid to write to the hard disk.

like image 762
javrucebo Avatar asked May 08 '15 17:05

javrucebo


1 Answers

As @r2evans noted, as.character(doc) is the solution.

Regarding you last code snippet, which wants to extract the <br>-separated text out of the node while converting <br> to newline, there is a workaround in the currently unresolved rvest issue #175, comment #2:

The simplified version for this problem:

doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')

# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"

# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "\n"){
    paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line\nSecond Line"
like image 181
akraf Avatar answered Oct 26 '22 18:10

akraf