Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly parse XML with SAX?

I am receiving an XML document from a REST service which shall be parsed using SAX. Please see the following example which was generated out of the XSD.

Setting up the parser is not a problem. My main problem is the actual processing in the startElement(), endElement() methods etc. I don't understand how to extract the items I need and store them as they are somewhat "nested".

Example

The ConnectionList can occur once or twice and may contain any number of Connection elements which -in turn- have details about a connection. Basically, I need a list of all connections with their Date, Transfers and Time. Do I have to create one class per element?

As far as I got it I somehow need to do the following: If the parser comes across a...

  • ConnectionList: Create new ConnectionList object and put it into a list of ConnectionLists
  • Connection: Create a new Connection object and put it into a list of Connections
  • Date, Transfers, Time (only if parent is Duration): Store the node value in the current Connection object

I'd really appreciate any help, hint, idea, snippet how I can achieve this.

Thanks :-)

Robert

<?xml version="1.0" encoding="UTF-8"?>
<ResC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Err code="r5E5a1Wm" text="tk-gWYbw" level="E"/>
    <Err code="takVDd34" text="XtvyjmjPuscK" level="E"/>
    <Err code="hQ1-:aDQ" text="YWc5qtY.gkwCeJW2S" level="E"/>
    <ConRes dir="R">
        <Err code="ZfwPC:tj" text="RKKFuLXoM0oOfp3a" level="E"/>
        <Err code="bhDjSJPa" text="BJoHuOMdwzhcddW" level="E"/>
        <Err code="CX-NhK9r" text="j55qy-WiNPXu" level="E"/>
        <ConResCtxt b="1" f="1">0815</ConResCtxt>
        <ConnectionList type="IV">
            <Err code="WI3WX.jo" text="rK3H5jwa-Zfen3" level="E"/>
            <Connection id="ID000">
                <Overview>
                    <Date>b3lcM_Yiyq7dqL9</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-1086549314">
                            <Address externalId="t.EdKe93xkqFqLwPzgd-4vHSJemy8"
                                externalStationNr="1332105793" name="fdREYJPu83WV503V8szdCX"
                                x="951177990" y="-1579782776" z="1807457957" type="WGS84"/>
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="1897526979">
                            <Address externalId="l7h_GTUit6fv" externalStationNr="-1670310329"
                                name="WJznDTzkTvyET51pfr7X" x="-1738098662" y="-170353174"
                                z="-475585957" type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>dZbgZfDH8j1hb1i</Transfers>
                    <Duration>
                        <Time>00d00:18:00</Time>
                    </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="qmrN2dShHJp"/>
                        <Product cat="Hg"/>
                        <Product cat="nurxhdl3w.P0x7FRv2J3UoF"/>
                    </Products>
                    <ContextURL url="http://FzgEqiVC/"/>
                </Overview>
            </Connection>
            <Connection id="ID004">
                <Overview>
                    <Date>W5a47DRkc7XDZjhwq_s5Un.</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-1014429844">
                            <Address externalId="RMnzjEFOTTdM1oaAUw" externalStationNr="1429101638"
                                name="HF-1" x="1005198487" y="570832676" z="975615566" type="WGS84"
                            />
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="-58308182">
                            <Address externalId="rVdwdQvAukfj2QcA7b3OSdGOyW"
                                externalStationNr="1142334006" name="g" x="-1791416159"
                                y="-541300941" z="478129823" type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>GG56XN6zgiJF804mE_N4o</Transfers>
                    <Duration> </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="fs_Oyoy9NYBai-qaxbty6j9Y7r1St"/>
                        <Product cat="P2CbaSGpC"/>
                        <Product cat="CGZrqSIDM6M4kUlb8_xZ8jRlH4c"/>
                    </Products>
                    <ContextURL url="http://JkRhuXtu/"/>
                </Overview>
            </Connection>
        </ConnectionList>
        <ConnectionList type="IV">
            <Err code="0lFWRY2X" text="KLmdczFRhV" level="E"/>
            <Connection id="ID012">
                <Overview>
                    <Date>t8mn634zjCZsRPyxj_e_-UYMH</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-2095085423">
                            <Address externalId="ftKAFG-Uk7x" externalStationNr="1390920810"
                                name="JQrQXOQbm.FLaCMeSiTYjT" x="1970142849" y="-655980297"
                                z="2102464970" type="WGS84"/>
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="1552118247">
                            <Address externalId="qcBpeuPDRzvSt1o" externalStationNr="-1133118359"
                                name="AJiJOB1t" x="-1422533132" y="-1158953133" z="484831466"
                                type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>D0MiUwW9nuuM_uykvawg2C07pwHL</Transfers>
                    <Duration> </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="LpGOZbLDbJm"/>
                        <Product cat="JIv-szQVX2icPb"/>
                        <Product cat="Q7-pthWoOT"/>
                    </Products>
                    <ContextURL url="http://zGWgivvi/"/>
                </Overview>
                <IList>
                    <I header="ze4Wt3hVD-DvjujY6QKae" text="lVwB4RxAHcYq3.F"
                        uriCustom="iVjQJCoU1MVOv2Z9lwarP"/>
                    <I header="z-i.au59soMzXLZCbV" text="PoTP" uriCustom="ksrbwEH6scNR"/>
                    <I header="N" text="jHDA4" uriCustom="ub95811lMIa_495ZbPOuNWL0rRWh"/>
                </IList>
                <CommentList>
                    <Comment id="ID013">
                        <Text lang="EN"> </Text>
                        <Text lang="FR"> </Text>
                        <Text lang="PL"> </Text>
                    </Comment>
                    <Comment id="ID014">
                        <Text lang="DK"> </Text>
                        <Text lang="IT"> </Text>
                        <Text lang="IT"> </Text>
                    </Comment>
                    <Comment id="ID015">
                        <Text lang="MACRO"> </Text>
                        <Text lang="IT"> </Text>
                        <Text lang="EN"> </Text>
                    </Comment>
                </CommentList>
            </Connection>
        </ConnectionList>
    </ConRes>
</ResC>
like image 828
Robert Strauch Avatar asked Nov 16 '10 23:11

Robert Strauch


People also ask

How can parsing the XML data using DOM and SAX?

The two common ways to parse an XML document are given below: DOM Parser: Parsing the document by loading all the content of the document and creating its hierarchical tree structure. SAX Parser: Parsing based on event-based triggers. It does not require the complete loading of content.

Which method does SAX use for processing XML documents?

SAX. The Simple API for XML (SAX) is an event-based API that uses callback routines or event handlers to process different parts of an XML documents. To use SAX, one needs to register handlers for different events and then parse the document.


2 Answers

The best way I've found (so far) of parsing XML using SAX is to use a stack and conditional statements in the relevant callbacks. Here's an article describing it, and my summary of it:

The basic premise is that as you parse the document, you create objects to store the parsed data, pushing them onto the stack as you go, peeking at the top of the stack to add data to the current element, and at the end of each element popping it off the stack and storing it in the parent.

The effect is that you parse the tree of elements depth first, and at the end of each branch you roll it back into the parent until you're left with a single object (such as your ConnectionList) that contains all of the parsed data ready to be used. Essentially, you end up with a series of objects that mirror the structure of the original XML

That means you need some data objects that can store the data in the same structure as the XML. Complex elements will normally become classes, while simple elements will normally be attributes within classes. The root element is often represented by a list of some kind.

To start with, you create a stack object to hold the data as you parse it.

Then, at the start of each element you identify what type it is using localName.equals() method, create an instance of the appropriate class, and push it into the Stack. If the element is a simple element, you will probably model that as an attribute in the class representing the parent element, and you will need a series of flags that tells the parser if such an element is encountered and what element it is so it can be processed in the characters() method.

The actual data is read using the characters() method, and again you use conditional logic to determine what to do with the data, based on the value of the flag. Essentially, you peek at the top of the stack and use the appropriate method to write the data into the object, converting from text where necessary.

At the end of each element, you pop the top of the stack and use localName.equals() again to determine how to store it in the object before it (e.g. which setter method needs to be called)

When you reach the end of the document you should have captured all the data in the document.

like image 162
chrisbunney Avatar answered Oct 02 '22 16:10

chrisbunney


Your SAX event handler should act as a state machine. Your structure is pretty deep, so the state machine will be a bit complex; but this is the basic approach:

All variables are member variables.

When you encounter a startElement event, you instantiate an object representing that element then put the object on a stack (or set a flag indicating what value you are working with).

When you encounter a text event, read the text and set the appropriate value based on the flag you set in the previous step.

When you encounter a endElement event, you pull the current object off the stack and call the setter on the object that is now on the top of the stack.

When you exhaust the document, you should only have one object left on the stack which represents everything you've read.

like image 35
Jeff Knecht Avatar answered Oct 02 '22 15:10

Jeff Knecht