I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below:
library(XML)
z <- xmlEventParse(
"my.xml",
handlers = list(
startDocument = function()
{
cat("Starting document\n")
},
startElement = function(name,attr)
{
if ( name == "myNodeToMatch1" ){
cat("FLAG Matched element 1\n")
}
if ( name == "myNodeToMatch2" ){
cat("FLAG Matched element 2\n")
}
},
text = function(text) {
if ( # Matched element 1 .... )
# Store text in element 1 list
if ( # Matched element 2 .... )
# Store text in element 2 list
},
endDocument = function()
{
cat("ending document\n")
}
),
addContext = FALSE,
useTagName = FALSE,
ignoreBlanks = TRUE,
trim = TRUE)
z$ ... # show lists ??
My question is, how to implement this flag in R (in a professional way :)? Plus: What's the best choice to evaluate N arbitrary nodes to match... if name = "myNodeToMatchN" ... nodes avoiding case matching?
my.xml could be just a naive XML like
<A>
<myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1>
<B>
<myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2>
...
</B>
</A>
I'll use fileName
from example(xmlEventParse)
as a reproducible example. It has tags record
that have an attribute id
and text that we'd like to extract. Rather than use handler
, I'll go after the branches
argument. This is like a handler, but one has access to the full node rather than just the element. The idea is to write a closure that has a place to keep the data we accumulate, and a function to process each branch of the XML document we are interested in. So let's start by defining the closure -- for our purposes, a function that returns a list of functions
ourBranches <- function() {
We need a place to store the results we accumulate, choosing an environment so that the insertion times are constant (not a list, which we would have to append to and would be memory inefficient)
store <- new.env()
The event parser is expecting a list of functions to be invoked when a matching tag is discovered. We're interested in the record
tag. The function we write will receive a node of the XML document. We want to extract an element id
that we'll use to store the (text) values in the node. We add these to our store.
record <- function(x, ...) {
key <- xmlAttrs(x)[["id"]]
value <- xmlValue(x)
store[[key]] <- value
}
Once the document is processed, we'd like a convenient way to retrieve our results, so we add a function for our own purposes, independent of nodes in the document
getStore <- function() as.list(store)
and then finish the closure by returning a list of functions
list(record=record, getStore=getStore)
}
A tricky concept here is that the environment in which a function is defined is part of the function, so each time we say ourBranches()
we get a list of functions and a new environment store
to keep our results. To use, invoke xmlEventParse
on our file, with an empty set of event handlers, and access our accumulated store.
> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 "
$`Toyota Corolla`
[1] "33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 "
For others who may try to lear from M.Morgan - here is the complete code
fileName = system.file("exampleData", "mtcars.xml", package = "XML")
ourBranches <- function() {
store <- new.env()
record <- function(x, ...) {
key <- xmlAttrs(x)[["id"]]
value <- xmlValue(x)
store[[key]] <- value
}
getStore <- function() as.list(store)
list(record=record, getStore=getStore)
}
branches <- ourBranches()
xmlEventParse(fileName, list(), branches=branches)
head(branches$getStore(), 2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With