Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the SAX parser process characters?

Tags:

python

xml

I wrote a little code to parse a XML file, and want to print it's characters, but each character seems invoke characters() callback function three times.

code:

def characters(self,chrs):
            if self.flag==1:
                    self.outfile.write(chrs+'\n')

xml file:

<e1>9308</e1>
<e2>865</e2>

and the output is like below, many blank lines.


9308


865

I think it should like:

9308

865

Why there are space line? and I read the doc info:

characters(self, content)

Receive notification of character data. The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

so SAX will process one character area as several fragments? and callback several times?

like image 424
dongjk Avatar asked Mar 22 '11 08:03

dongjk


People also ask

What does SAX parser do?

SAXParser provides method to parse XML document using event handlers. This class implements XMLReader interface and provides overloaded versions of parse() methods to read XML document from File, InputStream, SAX InputSource and String URI. The actual parsing is done by the Handler class.

Which method does SAX use for processing XML documents?

The Simple API for XML (SAX) is an event-based API that uses callback routines or event handlers to process different parts of an XML documents. To use SAX, one needs to register handlers for different events and then parse the document.

How does XML parsing with SAX?

SAX is an API used to parse XML documents. It is based on events generated while reading through the document. Callback methods receive those events. A custom handler contains those callback methods.

How does a SAX XML parser work how is it different from a DOM XML parser?

Key Difference of DOM and SAX DOM stands for Document Object Model while SAX stands for Simple API for XML parsing. DOM parser load full XML file in-memory and creates a tree representation of XML document, while SAX is an event based XML parser and doesn't load whole XML document into memory.


1 Answers

The example XML you posted is obviously not the full XML, because that would be malformed (and the SAX parser would tell you that instead of producing your output). So I'll assume that there's more to the XML than you showed us.

You need to be aware that every whitespace between any XML elements is character data. So if you have something like that:

<foo>
  <bar>123</bar>
</foo>

Then you have at least 3 text nodes: one containing "\n " (i.e. one newline, two space characters), one containing "123" and last but not least another one with "\n" (i.e. just a newline).

like image 179
Joachim Sauer Avatar answered Sep 21 '22 12:09

Joachim Sauer