Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R {xml_node} to plain text while preserving the tags?

Tags:

r

rvest

xml2

I'd like to do exactly what xml2::xml_text() or rvest::html_text() do but preserve the tags instead of replacing e.g. <br> with \n. The objective is to e.g. scrape a web page, extract the nodes I want, and store the plain HTML in a variable, much like write_html() would store it in a file.

How can I do this?

like image 989
Harold Cavendish Avatar asked Sep 14 '18 19:09

Harold Cavendish


1 Answers

Ironically, it turns out that as.character() works just fine.

Therefore:

library(rvest)
html <- read_html("http://stackoverflow.com")

res <– html %>%
         html_node("h1") %>%
         as.character()

> res

[1] "<h1 class=\"-title\">Learn, Share, Build</h1>"

This is the desired output in my current use case.

On the other hand, for comparison if one needs to strip the tags:

res <- html %>%
         html_node("h1") %>%
         html_text()

> res
[1] "Learn, Share, Build"
like image 140
Harold Cavendish Avatar answered Sep 20 '22 16:09

Harold Cavendish