Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Howto let the SAX parser determine the encoding from the xml declaration?

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:

SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = factory.newSAXParser(); FeedHandler handler = new FeedHandler(); InputSource is = new InputSource(getInputStream()); parser.parse(is, handler); 

Since SAX defaults to UTF-8 this is fine. However some of the documents declare:

<?xml version="1.0" encoding="ISO-8859-1"?> 

Even though ISO-8859-1 is declared SAX still defaults to UTF-8. Only if I add:

is.setEncoding("ISO-8859-1"); 

Will SAX use the correct encoding.

How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.

Thanks in advance, Allan

like image 988
Allan Avatar asked Aug 14 '10 07:08

Allan


People also ask

How does XML parsing with SAX?

SAX is an API used to parse XML documents. It is based on events generated while reading through the document. Callback methods receive those events. A custom handler contains those callback methods.

Where do you define the encoding details in an XML document?

XML Encoding is defined as the process of converting Unicode characters into binary format and in XML when the processor reads the document it mandatorily encodes the statement to the declared type of encodings, the character encodings are specified through the attribute 'encoding'.

How SAX is an alternative method for parsing XML document?

SAX (Simple API for XML) is an event-driven algorithm for parsing XML documents. SAX is an alternative to the Document Object Model (DOM). Where the DOM reads the whole document to operate on XML, SAX parsers read XML node by node, issuing parsing events while making a step through the input stream.


1 Answers

Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.

If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.

Why? Because autodetection encoding algorithms require raw data, not converted to characters.

The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.

like image 180
Jarekczek Avatar answered Sep 21 '22 18:09

Jarekczek