Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SAXParser fails to parse some characters

I am doing some simple SAXParsing with SAXParser etc on android/java

It can parse files properly, but hiccups when it encounters some special characters, for example if it parses this xml below:

<?xml version="1.0" encoding="ISO-8859-1" ?><MTRXML version="1.0">
<GEOCODE key="pohj">
<LOC name1="Pohjantori" number="" city="Espoo" code="995" address="" type="1" category="poi" x="2544225" y="6674893" lon="24.79378" lat="60.18324" />
<LOC name1="Pohjois-Haaga" number="" city="Helsinki" code="41" address="" type="1" category="poi" x="2549164" y="6680186" lon="24.88405" lat="60.23018" />
<LOC name1="Pohjois-Leppävaara" number="" city="Espoo" code="50" address="" type="1" category="poi" x="2545057" y="6679240" lon="24.80974" lat="60.22216" />

it will hiccup when it encounters ä in Pohjois-Leppävaara in the last line.

The error it gives is:

01-30 18:14:52.039: WARN/System.err(686): org.apache.harmony.xml.ExpatParser$ParseException: At line 5, column 24: not well-formed (invalid token)

I am sure SAXParser can handle those characters, but I believe I need to set some encoding etc somewhere ?

the Java code is so:

AXParserFactory factory = SAXParserFactory.newInstance();

    SAXParser parser = null;
    try {
        parser = factory.newSAXParser();
    } catch (ParserConfigurationException e) {
        e.printStackTrace();
        return null;
    } catch (SAXException e) {
        e.printStackTrace();
        return null;
    }

    XmlHandler handler = new XmlHandler();
    try {
        parser.parse(urls[0], handler);
    } catch (SAXException e) {
        e.printStackTrace();
        return null;
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
like image 398
Ahmad Mushtaq Avatar asked Jan 30 '11 16:01

Ahmad Mushtaq


1 Answers

I expect this is an error in the document encoding. Use a hex editor to verify that Leppävaara is the byte sequence 4c 65 70 70 e4 76 61 61 72 61. If ä is anything other than E4 then the document has been saved using some encoding other than ISO-8859-1.

like image 101
McDowell Avatar answered Sep 21 '22 23:09

McDowell