Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling change in newlines by XML transformation for CDATA from Java 8 to Java 11

With Java 9 there was a change in the way javax.xml.transform.Transformer with OutputKeys.INDENT handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:

<test><![CDATA[data]]></test>

But with Java 9 the same results in

<test>
    <![CDATA[data]]>
</test>

Which is not the same XML.

I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory with setIgnoringElementContentWhitespace=true but this no longer works for Java 11.

Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).

Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.

Sample program to demonstrate the issue:

public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
    String data = "data";

    StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
    StreamResult result = new StreamResult(new StringWriter());

    Transformer tform = TransformerFactory.newInstance().newTransformer();
    tform.setOutputProperty(OutputKeys.INDENT, "yes");
    tform.transform(source, result);

    String xml = result.getWriter().toString();

    System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11


    Document document = DocumentBuilderFactory.newInstance()
        .newDocumentBuilder()
        .parse(new InputSource(new StringReader(xml)));

    String resultData = document.getElementsByTagName("bar")
        .item(0)
        .getTextContent();

    System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}

EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291

like image 549
Rick Avatar asked Apr 25 '19 15:04

Rick


1 Answers

As your code relies on unspecified behavior, extra explicit code seems better:

  • You want indentation like:

    tform.setOutputProperty(OutputKeys.INDENT, "yes");
    tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
    
  • However not for elements containing a CDATA.

    String xml = result.getWriter().toString();
    // No indentation (whitespace) for elements with a CDATA section.
    xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");
    

The regex uses:

  • (?s) DOT_ALL to have . match any character, also newline characters.
  • .*? the shortest matching sequence, to not match "...]]>...]]>".

Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.

like image 82
Joop Eggen Avatar answered Nov 07 '22 00:11

Joop Eggen