Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing XML node values with R's xmlEventParse for filtered output

Tags:

r

xml

xml-parsing

I have a huge xml file (260mb) with tons of information looking like this:

Example:

<mydocument>
<POSITIONS EventTime="2012-09-29T20:31:21" InternalMatchId="0000T0">
<FrameSet GameSection="1sthalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
<FrameSet GameSection="2ndhalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21.196" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21.243" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
</POSITIONS>
</mydocument>

there are around 40 different FrameSet nodes, each with a different GameSection="..." and Object="...".

I would love to extract the information of the <Frame> nodes into a list object but I cannot load the whole xml file because it is too large. Is there any way, I can use the xmlEventParse function to filter for a specific GameSection and a specific Object and get all the information from the corresponding <Frame> elements?

like image 355
user2406692 Avatar asked May 21 '13 18:05

user2406692


People also ask

How does Node-RED store data?

Node-Red provides three mechanisms: 1 The context object -stores data for a node 2 The Flow object – stores data for a flow 3 The global object -stores data for the canvas More ...

How to use the flow object in Node-RED?

You use the flow object in the same way as the context object. To retrieve values stored in the flow object use: This time you should notice that the functions can share variables stored in the global object even across flows. In node-red version 0.19 it became possible to store and retrieve several variables at once. So instead of using:

Does it matter where I place the Node-RED settings file?

It doesn’t matter where in the settings file you place it and depending on the version of node-red you started with you may already have an entry that is commented out.


Video Answer


1 Answers

It might be that the 'internal' representation is not that large

xml = xmlTreeParse("file.xml", useInternalNodes=TRUE)

and then xpath will definitely be your best bet. If that doesn't work, you'll need to get your head around closures. I'm going to aim for the branches argument of xmlEventParse, which allows a hybrid event parsing to iterate through the file, coupled with DOM parsing on each node. Here's a function that returns a list of functions.

branchFactory <-
    function()
{
    env <- new.env(parent=emptyenv())   # safety

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

    get <- function() env

    list(get=get, FrameSet=FrameSet)
}

Inside this function we're going to create a place to store our results as we iterate through the file. This could be a list, but it'll be better to use an environment. This will allow us to insert new results without copying all the results that we've already inserted. So here's our environment:

    env <- new.env(parent=emptyenv())

we use the parent argument as a measure of safety, even if it's not relevant in our present case. Now we define a function that will be invoked whenever a "FrameSet" node is encountered

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

It turns out that, when we use the branches argument, the xmlEventParse will have arranged to parse the entire node into an object that we can manipulate via the DOM, e.g., using xlmAttrs and xpathSApply. The first line of this function creates a unique identifier for this frame set (? maybe that's not the case for the full data set? You'll need a unique identifier). we then parse the "//Frame" part of the element, and store that in our environment. Storing the result is trickier than it looks -- we're assigning to a variable called env. env doesn't exist in the body of the FrameSet function, so R uses its lexical scoping rules to search for a variable named env in the environment in which the FrameSet function was defined. And lo, it finds the env that we have already created. This is where we add the result of xpathSApply to. That's it for our FrameSet node parser.

We'd also like a convenience function that we can use to retrieve env, like this:

    get <- function() env

Again, this is going to use lexical scoping to find the env variable created at the top of branchFactory. We end branchFactory by returning a list of the functions that we've defined

    list(get=get, FrameSet=FrameSet)

This too is surprisingly tricky -- we're returning a list of functions. The functions are defined in the environment created when we invoke branchFactory and, for lexical scope to work, the environment has to persist. So actually we're returning not only the list of functions, but also, implicitly, the variable env. In brief

We're now ready to parse our file. Do this by creating an instance of the branch parser, with it's own unique versions of the get and FrameSet functions and of the env variable created to store results. Then parse the file

b <- branchFactory()
xx <- xmlEventParse("file.xml", handlers=list(), branches=b)

We can retrieve the results using b$get(), and can cast this to a list if that's convenient.

> as.list(b$get())
$`1sthalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                  [,3]                 
N "0"                   "1"                   "2"                  
T "2012-09-29T18:31:21" "2012-09-29T18:31:21" "2012-09-29T18:31:21"
X "-0.1158"             "-0.1146"             "-0.1134"            
Y "0.2347"              "0.2351"              "0.2356"             
S "1.27"                "1.3"                 "1.33"               

$`2ndhalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                      [,3]                     
N "0"                   "1"                       "2"                      
T "2012-09-29T18:31:21" "2012-09-29T18:31:21.196" "2012-09-29T18:31:21.243"
X "-0.1158"             "-0.1146"                 "-0.1134"                
Y "0.2347"              "0.2351"                  "0.2356"                 
S "1.27"                "1.3"                     "1.33"                   
like image 92
Martin Morgan Avatar answered Nov 15 '22 03:11

Martin Morgan