Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java XMLReader not clearing multi-byte UTF-8 encoded attributes

I've got a really strange situation where my SAX ContentHandler is being handed bad Attributes by XMLReader. The document being parsed is UTF-8 with multi-byte characters inside XML attributes. What appears to happen is that these attributes are being accumulated each time my handler is called. So rather than being passed in succession, they get concatenated onto the previous node's value.

Here is an example which demonstrates this using public data (Wikipedia).

public class MyContentHandler extends org.xml.sax.helpers.DefaultHandler {

    public static void main(String[] args) {
        try {
            org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();
            reader.setContentHandler(new MyContentHandler());
            reader.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apfilterredir=redirects&apdir=descending");

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    public void startElement(String uri, String localName, String qName, org.xml.sax.Attributes attributes) {
        if ("p".equals(qName)) {
            String title = attributes.getValue("title");
            System.out.println(title);
        }
    }
}

Update: This complete example produces (apologies to any Cantonese speakers for the vulgar output):

𩧢
𩧢𨳒
𩧢𨳒🛅
𩧢𨳒🛅🛄
𩧢𨳒🛅🛄🛃
𩧢𨳒🛅🛄🛃🛂
𩧢𨳒🛅🛄🛃🛂🛁
𩧢𨳒🛅🛄🛃🛂🛁🛀
𩧢𨳒🛅🛄🛃🛂🛁🛀🚿
𩧢𨳒🛅🛄🛃🛂🛁🛀🚿🚾

Does anyone have any clue what is happening and how to fix it? What comes back in the document doesn't match what is happening as I debug through this snippet.

like image 977
mckamey Avatar asked Nov 15 '22 00:11

mckamey


1 Answers

Seems to be a bug in the JRE included version of Xerces (com.sun.org.apache.xerces.internal.parsers.SAXParser). Below are my notes.

The version bundled with JRE 1.6.0_24, v2.4.0, v2.5.0, v2.6.0 does do accumulation of Attributes.

Xerces-J v1.4.4 does not appear to have the bug.

Xerces2-J v2.6.1, v2.6.2, v2.9.0, 2.11.0 does not appear to have the bug.

You can tell by the versions tested that I was bisecting the version history. Appears to be something fixed between v2.6.0 and v2.6.1. I'm kind of surprised the JRE hasn't been updated as it was fixed in the main Xerces about 7 years ago!

like image 197
mckamey Avatar answered Apr 30 '23 14:04

mckamey