I have a huge xml file (260mb) with tons of information looking like this:
Example:
<mydocument>
<POSITIONS EventTime="2012-09-29T20:31:21" InternalMatchId="0000T0">
<FrameSet GameSection="1sthalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
<FrameSet GameSection="2ndhalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21.196" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21.243" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
</POSITIONS>
</mydocument>
there are around 40 different FrameSet nodes, each with a different GameSection="..."
and Object="..."
.
I would love to extract the information of the <Frame>
nodes into a list
object but I cannot load the whole xml file because it is too large. Is there any way, I can use the xmlEventParse
function to filter for a specific GameSection and a specific Object and get all the information from the corresponding <Frame>
elements?
Node-Red provides three mechanisms: 1 The context object -stores data for a node 2 The Flow object – stores data for a flow 3 The global object -stores data for the canvas More ...
You use the flow object in the same way as the context object. To retrieve values stored in the flow object use: This time you should notice that the functions can share variables stored in the global object even across flows. In node-red version 0.19 it became possible to store and retrieve several variables at once. So instead of using:
It doesn’t matter where in the settings file you place it and depending on the version of node-red you started with you may already have an entry that is commented out.
It might be that the 'internal' representation is not that large
xml = xmlTreeParse("file.xml", useInternalNodes=TRUE)
and then xpath will definitely be your best bet. If that doesn't work, you'll need to get your head around closures. I'm going to aim for the branches
argument of xmlEventParse
, which allows a hybrid event parsing to iterate through the file, coupled with DOM parsing on each node. Here's a function that returns a list of functions.
branchFactory <-
function()
{
env <- new.env(parent=emptyenv()) # safety
FrameSet <- function(elt) {
id <- paste(xmlAttrs(elt), collapse=":")
env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
}
get <- function() env
list(get=get, FrameSet=FrameSet)
}
Inside this function we're going to create a place to store our results as we iterate through the file. This could be a list, but it'll be better to use an environment. This will allow us to insert new results without copying all the results that we've already inserted. So here's our environment:
env <- new.env(parent=emptyenv())
we use the parent
argument as a measure of safety, even if it's not relevant in our present case. Now we define a function that will be invoked whenever a "FrameSet" node is encountered
FrameSet <- function(elt) {
id <- paste(xmlAttrs(elt), collapse=":")
env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
}
It turns out that, when we use the branches
argument, the xmlEventParse
will have arranged to parse the entire node into an object that we can manipulate via the DOM, e.g., using xlmAttrs
and xpathSApply
. The first line of this function creates a unique identifier for this frame set (? maybe that's not the case for the full data set? You'll need a unique identifier). we then parse the "//Frame" part of the element, and store that in our environment. Storing the result is trickier than it looks -- we're assigning to a variable called env
. env
doesn't exist in the body of the FrameSet function, so R uses its lexical scoping rules to search for a variable named env
in the environment in which the FrameSet function was defined. And lo, it finds the env
that we have already created. This is where we add the result of xpathSApply
to. That's it for our FrameSet node parser.
We'd also like a convenience function that we can use to retrieve env
, like this:
get <- function() env
Again, this is going to use lexical scoping to find the env
variable created at the top of branchFactory
. We end branchFactory
by returning a list of the functions that we've defined
list(get=get, FrameSet=FrameSet)
This too is surprisingly tricky -- we're returning a list of functions. The functions are defined in the environment created when we invoke branchFactory
and, for lexical scope to work, the environment has to persist. So actually we're returning not only the list of functions, but also, implicitly, the variable env
. In brief
We're now ready to parse our file. Do this by creating an instance of the branch parser, with it's own unique versions of the get
and FrameSet
functions and of the env
variable created to store results. Then parse the file
b <- branchFactory()
xx <- xmlEventParse("file.xml", handlers=list(), branches=b)
We can retrieve the results using b$get()
, and can cast this to a list if that's convenient.
> as.list(b$get())
$`1sthalf:0000T0:REFEREE:00011D`
[,1] [,2] [,3]
N "0" "1" "2"
T "2012-09-29T18:31:21" "2012-09-29T18:31:21" "2012-09-29T18:31:21"
X "-0.1158" "-0.1146" "-0.1134"
Y "0.2347" "0.2351" "0.2356"
S "1.27" "1.3" "1.33"
$`2ndhalf:0000T0:REFEREE:00011D`
[,1] [,2] [,3]
N "0" "1" "2"
T "2012-09-29T18:31:21" "2012-09-29T18:31:21.196" "2012-09-29T18:31:21.243"
X "-0.1158" "-0.1146" "-0.1134"
Y "0.2347" "0.2351" "0.2356"
S "1.27" "1.3" "1.33"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With