I have a String
contating binary 0
inside in UTF-8 ("A\u0000B"
). JAXB happily marshalls XML document containing such character but then fails to unmarshall it:
final JAXBContext jaxbContext = JAXBContext.newInstance(Root.class);
final Marshaller marshaller = jaxbContext.createMarshaller();
final Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
Root root = new Root();
root.value = "A\u0000B";
final ByteArrayOutputStream os = new ByteArrayOutputStream();
marshaller.marshal(root, os);
unmarshaller.unmarshal(new ByteArrayInputStream(os.toByteArray()));
The root class is just simple:
@XmlRootElement
class Root { @XmlValue String value; }
Output XML contains binary 0
as well between A
and B
(in hex: 41 00 42
) which causes the following error during unmarshalling:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 63;
An invalid XML character (Unicode: 0x0) was found in the element content of the document.
Interestingly using raw DOM API (example) produces escaped 0
: A�B
but trying to read it back yields similar error. Also 0
(neither binary nor escaped) is not allowed by any XML parser or xmllint
(see also: Python + Expat: Error on � entities).
why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?
is there some elegant and global solution? I saw people tackling this problem by:
manually ignoring special characters from input
intercepting incoming stream or even
implementing some internal com.sun.xml.internal.bind.marshaller.CharacterEscapeHandler
class
But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting? I'm looking for escaping, ignoring or failing fast - but the default behavior of generating invalid XML is not acceptable. I believe such fundamental functionality should not require any extra coding on the client side.
why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?
You would need to ask the implementors.
It is possibly that they thought that the expense of checking every data character serialised was not justified ... especially if the parser is then going to check them all over again.
Having decided to implement the serializer this way (or having just done so by mistake), if they then changed the behaviour to do strict checking by default, they would break existing code that depends on being able to serialise illegal XML.
But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting?
Not necessarily ... if you accept the reason #2 above. Even a simple settings could have a measurable impact on performance.
Also 0 (neither binary nor escaped) is not allowed by any XML parser or xmllint ...
Quite rightly so! It is forbidden by the XML spec.
However, a more interesting test would be to see what happens when you try to generate XML containing an illegal character using other XML stacks.
is there some elegant and global solution?
If the problem you are trying to solve is how to send a \u0000
or \u000B
, then you need to apply some application-specific encoding to the String before you insert it into the DOM. And the other end needs to deploy the equivalent decoding.
If the problem you are trying to solve is how to detect the bad data before it is too late, you could do this with an output stream filter between the serializer and the final output stream. But if you detect the badness, there is no good (i.e. transparent to the XML consumer) way to fix it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With