I'm so used to using <?xml version="1.0" encoding="UTF-8"?> that it didn't occur until now that there might be some subtleties with other encodings using the standard Java XML libraries (SAX, DOM, STaX)...
Do these libraries automatically handle the encoding attribute in the header when reading XML documents? If so, where is this documented? (It's not in DocumentBuilder or DocumentBuilderFactory) If not, what do I have to do to make it work out OK for different encodings?
DocumentBuilder uses the SAX API to provide the document to the implementation for parsing (though the implementation might not actually use a SAX parser), and the Javadoc for SAX's org.xml.sax.InputSource says what it does with the header.
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So interesting cases could include an XML stream supplied via HTTP, with an HTTP Content-Type header that conflicts with the XML's encoding declaration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With