Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") is NOT working

I have the following method to write an XMLDom to a stream:

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    transformer.transform(docSource, new StreamResult(out));
}

I am testing some other XML functionality, and this is just the method that I use to write to a file. My test program generates 33 test cases where files are written out. 28 of them have the following header:

<?xml version="1.0" encoding="UTF-8"?>...

But for some reason, 1 of the test cases now produce:

<?xml version="1.0" encoding="ISO-8859-1"?>...

And four more produce:

<?xml version="1.0" encoding="Windows-1252"?>...

As you can clearly see, I am setting ENCODING output key to UTF-8. These tests used to work on an earlier version of Java. I have not run the tests in a while (more than a year) but running today on "Java(TM) SE Runtime Environment (build 1.6.0_22-b04)" I get this funny behavior.

I have verified that the documents causing the problem were read from files that originally had those encoding. It seems that the new versions of the libraries are attempting to preserve the encoding of the source file that was read. But that is not what I want ... I really do want the output to be in UTF-8.

Does anyone know of any other factor that might cause the transformer to ignore the UTF-8 encoding setting? Is there anything else that has to be set on the document to say to forget the encoding of the file that was originally read?

UPDATE:

I checked out the same project out on another machine, built and ran the tests there. On that machine all the tests pass! All the files have "UTF-8" in their header. That machine has "Java(TM) SE Runtime Environment (build 1.6.0_29-b11)" Both machines are running Windows 7. On the new machine that works correctly, jdk1.5.0_11 is used to make the build, but on the old machine jdk1.6.0_26 is used to make the build. The libraries used for both builds are exactly the same. Can it be a JDK 1.6 incompatibility with 1.5 at build time?

UPDATE:

After 4.5 years, the Java library is still broken, but due to the suggestion by Vyrx below, I finally have a proper solution!

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    out.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>".getBytes("UTF-8"));
    transformer.transform(docSource, new StreamResult(out));
}

The solution is to disable the writing of the header, and to write the correct header just before serializing the XML to the output steam. Lame, but it produces the correct results. Tests broken over 4 years ago are now running again!

like image 300
AgilePro Avatar asked Mar 23 '13 21:03

AgilePro


3 Answers

I had the same problem on Android when serializing emoji characters. When using UTF-8 encoding in the transformer the output was HTML character entities (UTF-16 surrogate pairs), which would subsequently break other parsers that read the data.

This is how I ended up solving it:

StringWriter sw = new StringWriter();
sw.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>");
Transformer t = TransformerFactory.newInstance().newTransformer();

// this will work because we are creating a Java string, not writing to an output
t.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); 
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(elementNode), new StreamResult(sw));

return IOUtils.toInputStream(sw.toString(), Charset.forName("UTF-8"));
like image 159
Vyrx Avatar answered Nov 07 '22 15:11

Vyrx


To answer the question following code works for me. This can take input encoding and convert the data into output encoding.

        ByteArrayInputStream inStreamXMLElement = new ByteArrayInputStream(strXMLElement.getBytes(input_encoding));
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder(); 
        Document docRepeat = db.parse(new InputSource(new InputStreamReader(inStreamXMLElement, input_encoding)));
        Node elementNode = docRepeat.getElementsByTagName(strRepeat).item(0);

        TransformerFactory tFactory = null;
        Transformer transformer = null;
        DOMSource domSourceRepeat = new DOMSource(elementNode);
        tFactory = TransformerFactory.newInstance();
        transformer = tFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, output_encoding);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        StreamResult sr = new StreamResult(new OutputStreamWriter(bos, output_encoding));


        transformer.transform(domSourceRepeat, sr);
        byte[] outputBytes = bos.toByteArray();
        strRepeatString = new String(outputBytes, output_encoding);
like image 2
Ramesh Reddy Avatar answered Nov 07 '22 14:11

Ramesh Reddy


I've spent significant amount of time debugging this issue because it was working well on my machine (Ubuntu 14 + Java 1.8.0_45) but wasn't working properly in production (Alpine Linux + Java 1.7).

Contrary to my expectation following from above mentioned answer didn't help.

ByteArrayOutputStream bos = new ByteArrayOutputStream();
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, "UTF-8"));

but this one worked as expected

val out = new StringWriter()
val result = new StreamResult(out)
like image 1
expert Avatar answered Nov 07 '22 15:11

expert