Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML

Question

I am developing an app for Android where I have to parse different XML files. Most of them are encoded in UTF-8, but a few may be encoded in ISO-8859-1.

  HttpURLConnection con = (HttpURLConnection) url.openConnection();
  ...
  in = con.getInputStream();
  InputSource is = new InputSource(in);
  ...
  parser.parse(is, handler);

My code for handling the input looks like above. The java documentation says about the InputSource:

If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification.

I am passing in a ByteStream and I am don't specify an encoding, so according to the documentation the encoding should be auto detected. But it doesn't. All files that are encoded in UTF-8 are fine, but the ISO-8859-1 ones are not (I am getting a Parser Expat... Exception for some invalid characters). If I set the encoding of the InputSource manually to "ISO-8859-1" it behaves the other way round.

How can I solve this? I searched Google and Stackoverflow for hours, but not finding a solution. I also tried to pass a CharacterStream to the InputSource, but some characters (äöüÄÖÜß) in the ISO-8859-1 files are still displayed as "?" in my app.

Thanks in advance!

jarnbjo · Accepted Answer

The best solution depends on the exact cause of your problem. If you retrieve an XML document over HTTP, the encoding may also be specified in the Content-Type response header and not necessarily in the XML document itself. If that is the case and the XML libraries in Android are correctly implemented (I have no way to check here if the Content+Type header is evaluated), you should be able to create an InputSource with the URL directly new InputSource("http://..."); instead.

If the encoding is not set in the HTTP header and not specified in the XML prologue, the parser operates correctly if it assumes UTF-8 encoding (as mandated by the XML specification). The autodetection mentioned in the documentation does not mean that the parser actually looks into the document content to make an assumption on the encoding, but means that it checks the encoding attribute of the XML stream. If the encoding attribute is missing, it defaults to UTF-8.

rekire · Answer

I would suggest to check if there are characters which are not in the old ascii set and reencode the string if there seems to be UTF-8 chars:

String output=new String(input.getBytes("8859_1"), "utf-8");

That line takes the ISO-8859-1 and converts it to utf-8 which is used by Java.

Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML

Tags:

java

android

xml

encoding

sax

Marius5000

2 Answers

jarnbjo

rekire

Recent Activity

Donate For Us

Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML

Tags:

java

android

xml

encoding

sax

Marius5000

2 Answers

jarnbjo

rekire

Related questions

Recent Activity

Donate For Us