Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse an XML file and return an R character vector

Tags:

r

xml

I've parsed an XML document with R, e.g:

library(XML)
f = system.file("exampleData", "mtcars.xml", package="XML")
doc = xmlParse(f)

Using XPath expressions, I can select specific nodes in the document:

> getNodeSet(doc, "//record[@id='Mazda RX4']/text()")
[[1]]
   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 

    attr(,"class")
    [1] "XMLNodeSet"

But I can't figure out how to turn the result into an R character vector:

> as.character(getNodeSet(doc, "//record[@id='Mazda RX4']/text()"))
[1] "<pointer: 0x000000000e6a7fe0>"

How do I get text from an internal pointer to a C object?

like image 617
Zach Avatar asked Jul 12 '12 15:07

Zach


People also ask

How do I parse XML in R?

An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.

What is R tag in XML?

It stands for Extensible Markup Language (XML). Similar to HTML it contains markup tags. But unlike HTML where the markup tag describes structure of the page, in xml the markup tags describe the meaning of the data contained into he file. You can read a xml file in R using the "XML" package.

Can we import XML file in R?

File formats like csv, xml, xlsx, json, and web data can be imported into the R environment to read the data and perform data analysis, data manipulations and after data analysis data in R can be exported to external files in the same file formats.

What is XML formatting?

What is XML? The Extensible Markup Language (XML) is a simple text-based format for representing structured information: documents, data, configuration, books, transactions, invoices, and much more. It was derived from an older standard format called SGML (ISO 8879), in order to be more suitable for Web use.


2 Answers

Use xmlValue. Here's an extension of your example to help you see what the classes are:

v <- getNodeSet(doc, "//record[@id='Mazda RX4']/text()")
str(v)
#List of 1
#$ :Classes 'XMLInternalTextNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
#- attr(*, "class")= chr "XMLNodeSet"
v2 <- sapply(v, xmlValue)  #this is the code chunk of interest to you
v2
#[1] "   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4"
str(v2)
#chr "   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4"
like image 180
Tyler Rinker Avatar answered Sep 19 '22 23:09

Tyler Rinker


The following will also work: Instead of getNodeSet() and sapply(v,xmlValue), you can use xpathApply and add xmlValue as an argument

doc = xmlParse(f)
xpathApply(doc,"//record[@id='Mazda RX4']/text()")

[[1]]
   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 

attr(,"class")
[1] "XMLNodeSet"

xpathApply(doc,"//record[@id='Mazda RX4']/text()",xmlValue)

[[1]]
[1] "   21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4"

This is a character object in a list. You can transform it into a vector of numeric objects by unlisting, splitting the string with regex of one or more spaces, unlisting again and as.numeric()

 as.numeric(unlist(strsplit(unlist(v)," +")))
 [1]     NA  21.00   6.00 160.00 110.00   3.90   2.62  16.46   0.00   1.00   4.00   4.00
like image 20
Matthew MacLennan Avatar answered Sep 20 '22 23:09

Matthew MacLennan