Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse an XML file containing BOM?

Tags:

java

xml

jdom

I want to parse an XML file from URL using JDOM. But when trying this:

SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);

I get this exception:

Invalid byte 1 of 1-byte UTF-8 sequence.

I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM. I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.

I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.

I appreciate any help on the possible cause of this issue.

like image 681
doctrey Avatar asked Dec 12 '11 21:12

doctrey


1 Answers

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:

private InputStream openStream(final URL url) throws IOException
{
    final URLConnection cxn = url.openConnection();
    final String contentEncoding = cxn.getContentEncoding();
    if(contentEncoding == null)
        return cxn.getInputStream();
    else if(contentEncoding.equalsIgnoreCase("gzip")
               || contentEncoding.equalsIgnoreCase("x-gzip"))
        return new GZIPInputStream(cxn.getInputStream());
    else
        throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use:

builder.build(openStream(aUrl.openStream()));

. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.

See the documentation for java.net.URLConnection.

like image 89
ruakh Avatar answered Oct 03 '22 08:10

ruakh