Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert XMLInternalDocument to character vector

Tags:

r

xml

What is the best way to cast an object from the {XML} package back to a "normal" R character vector?

For example:

require(XML)
doc <- htmlParse("http://cran.r-project.org/web/packages/XML/index.html")
class(doc)
# [1] "HTMLInternalDocument" "HTMLInternalDocument" 
# "XMLInternalDocument"  "XMLAbstractDocument" 

Similar to this suggestion, I could do this:

doc.char <- capture.output(doc)

But this seems like a circuitous route. However, I didn't find any other appropriate method. And this bugged me already a few times.

like image 782
lukeA Avatar asked Feb 18 '14 10:02

lukeA


2 Answers

If you just want a character vector then use readLines() instead of htmlParse(). But likely you have a more specific need and then the answer is to use XPath to query doc; see ?getNodeSet (and the syntax doc["//path"]) and the examples on that help page.

For your specific question I did

library(XML)
doc <- htmlParse("http://cran.r-project.org/web/packages/XML/index.html")
showMethods(class=class(doc), where=search())

and arrived at

as(doc, "character")
like image 66
Martin Morgan Avatar answered Oct 18 '22 03:10

Martin Morgan


I think you can achieve this with do.call(paste, as.list(capture.output(doc)))

(I had some issues too and I think you can do it as well with sapply as @flodel suggested me here on nodes NodeSet as character)

like image 31
Julien Navarre Avatar answered Oct 18 '22 05:10

Julien Navarre