Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML

I am developing an app for Android where I have to parse different XML files. Most of them are encoded in UTF-8, but a few may be encoded in ISO-8859-1.

  HttpURLConnection con = (HttpURLConnection) url.openConnection();
  ...
  in = con.getInputStream();
  InputSource is = new InputSource(in);
  ...
  parser.parse(is, handler);

My code for handling the input looks like above. The java documentation says about the InputSource:

If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification.

I am passing in a ByteStream and I am don't specify an encoding, so according to the documentation the encoding should be auto detected. But it doesn't. All files that are encoded in UTF-8 are fine, but the ISO-8859-1 ones are not (I am getting a Parser Expat... Exception for some invalid characters). If I set the encoding of the InputSource manually to "ISO-8859-1" it behaves the other way round.

How can I solve this? I searched Google and Stackoverflow for hours, but not finding a solution. I also tried to pass a CharacterStream to the InputSource, but some characters (äöüÄÖÜß) in the ISO-8859-1 files are still displayed as "?" in my app.

Thanks in advance!

like image 799
Marius5000 Avatar asked Nov 12 '22 15:11

Marius5000


2 Answers

The best solution depends on the exact cause of your problem. If you retrieve an XML document over HTTP, the encoding may also be specified in the Content-Type response header and not necessarily in the XML document itself. If that is the case and the XML libraries in Android are correctly implemented (I have no way to check here if the Content+Type header is evaluated), you should be able to create an InputSource with the URL directly new InputSource("http://..."); instead.

If the encoding is not set in the HTTP header and not specified in the XML prologue, the parser operates correctly if it assumes UTF-8 encoding (as mandated by the XML specification). The autodetection mentioned in the documentation does not mean that the parser actually looks into the document content to make an assumption on the encoding, but means that it checks the encoding attribute of the XML stream. If the encoding attribute is missing, it defaults to UTF-8.

like image 139
jarnbjo Avatar answered Nov 15 '22 06:11

jarnbjo


I would suggest to check if there are characters which are not in the old ascii set and reencode the string if there seems to be UTF-8 chars:

String output=new String(input.getBytes("8859_1"), "utf-8");

That line takes the ISO-8859-1 and converts it to utf-8 which is used by Java.

like image 40
rekire Avatar answered Nov 15 '22 07:11

rekire