Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I stop XmlSerializer transforming ê to ê in an attribute?

I have the following DOM

    <row>
        <link href="B&#252;ro.txt" target="_blank">
            my link
        </link>
    </row>

When I serialize it to a file using the Java XmlSerializer it comes out like this:

    <row>
        <link href="B&amp;#252;ro.txt" target="_blank">
            my link
        </link>
    </row>

Is there any way to control the way XmlSerializer handles escaping in attributes? Should I be doing this differently any way?

Update

I should also say that I am using jre 1.6. I had been using jre 1.5 until recently and I am pretty sure that it was serialized 'correctly' (i.e. the '&' was not escaped)

Clarification

The DOM is created programmatically. Here is an example:

        Document doc = createDocument();
        Element root = doc.createElement("root");
        doc.appendChild(root);
        root.setAttribute("test1", "&#234;");
        root.setAttribute("test2", "üöä");
        root.appendChild(doc.createTextNode("&#234;"));

        StringWriter sw = new StringWriter();

        serializeDocument(doc, sw);
        System.out.println(sw.toString());

My solution I didn't really want to do this because it involved a fair amount of code change and testing but I decided to move the attribute data into a CDATA element. Problem solved avoided.

like image 270
paul Avatar asked Jun 09 '10 05:06

paul


2 Answers

The problem is that you are building the DOM with attribute values that have already been "escaped" according to the XML conventions. The DOM (of course) doesn't realize that you have done this and is escaping the ampersand.

You should change

root.setAttribute("test1", "&#234;");

to

root.setAttribute("test1", "\u00EA");

In other words, use strings consisting of plain Unicode codepoints when constructing the DOM. The XMLSerializer should then replace Unicode characters with character entities as required ... depending on the chosen character encoding for the output document.

EDIT - The reason that you may still be seeing raw characters rather than character entities in the ouput XML is that the XMLSerializer is using the default encoding for XML; i.e. UTF-8. The way to address this is use the XMLSerializer(OutputFormat) constructor, passing an OutputFormat that specifies the required character encoding for the XML. (It sounds like you are using "ASCII".) Be sure to use to compatible character encoding for the OutputStream.

like image 52
Stephen C Avatar answered Oct 25 '22 03:10

Stephen C


How do you obtain the DOM? Could it have something to do with that? I tried your sample XML with the standard DocumentBuilder (just b/c I'm more familiar with it) using Sun Java 6 and the latest Xerces-J (2.9.1) which by the way deprecates XmlSerializer in favor of LSSerializer or TrAX.

Anyway, using this technique, the serialized document does not even contain the character reference anymore and gets converted to "Büro.txt". I used the following code:

String xml = "<row>\n"
    + "        <link href=\"B&#252;ro.txt\" target=\"_blank\">\n"
    + "            my link\n" + "        </link>\n" + "    </row>";

InputStream is = new ByteArrayInputStream(xml.getBytes());
Document doc = DocumentBuilderFactory.newInstance()
    .newDocumentBuilder().parse(is);

XMLSerializer xs = new XMLSerializer();
xs.setOutputCharStream(new PrintWriter(System.err));

xs.serialize(doc);
like image 20
musiKk Avatar answered Oct 25 '22 01:10

musiKk