Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to unmarshall \u0000 after successfully marshalling it [closed]

I have a String contating binary 0 inside in UTF-8 ("A\u0000B"). JAXB happily marshalls XML document containing such character but then fails to unmarshall it:

final JAXBContext jaxbContext = JAXBContext.newInstance(Root.class);
final Marshaller marshaller = jaxbContext.createMarshaller();
final Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();

Root root = new Root();
root.value = "A\u0000B";

final ByteArrayOutputStream os = new ByteArrayOutputStream();
marshaller.marshal(root, os);

unmarshaller.unmarshal(new ByteArrayInputStream(os.toByteArray()));

The root class is just simple:

@XmlRootElement
class Root { @XmlValue String value; }

Output XML contains binary 0 as well between A and B (in hex: 41 00 42) which causes the following error during unmarshalling:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 63; 
An invalid XML character (Unicode: 0x0) was found in the element content of the document.

Interestingly using raw DOM API (example) produces escaped 0: A�B but trying to read it back yields similar error. Also 0 (neither binary nor escaped) is not allowed by any XML parser or xmllint (see also: Python + Expat: Error on � entities).

My questions:

  • why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?

  • is there some elegant and global solution? I saw people tackling this problem by:

    • manually ignoring special characters from input

    • intercepting incoming stream or even

    • implementing some internal com.sun.xml.internal.bind.marshaller.CharacterEscapeHandler class

But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting? I'm looking for escaping, ignoring or failing fast - but the default behavior of generating invalid XML is not acceptable. I believe such fundamental functionality should not require any extra coding on the client side.

like image 855
Tomasz Nurkiewicz Avatar asked Oct 08 '12 10:10

Tomasz Nurkiewicz


1 Answers

why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?

  1. You would need to ask the implementors.

  2. It is possibly that they thought that the expense of checking every data character serialised was not justified ... especially if the parser is then going to check them all over again.

  3. Having decided to implement the serializer this way (or having just done so by mistake), if they then changed the behaviour to do strict checking by default, they would break existing code that depends on being able to serialise illegal XML.

But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting?

Not necessarily ... if you accept the reason #2 above. Even a simple settings could have a measurable impact on performance.


Also 0 (neither binary nor escaped) is not allowed by any XML parser or xmllint ...

Quite rightly so! It is forbidden by the XML spec.

However, a more interesting test would be to see what happens when you try to generate XML containing an illegal character using other XML stacks.


is there some elegant and global solution?

If the problem you are trying to solve is how to send a \u0000 or \u000B, then you need to apply some application-specific encoding to the String before you insert it into the DOM. And the other end needs to deploy the equivalent decoding.

If the problem you are trying to solve is how to detect the bad data before it is too late, you could do this with an output stream filter between the serializer and the final output stream. But if you detect the badness, there is no good (i.e. transparent to the XML consumer) way to fix it.

like image 115
Stephen C Avatar answered Sep 22 '22 02:09

Stephen C