With Java 9 there was a change in the way javax.xml.transform.Transformer
with OutputKeys.INDENT
handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:
<test><![CDATA[data]]></test>
But with Java 9 the same results in
<test>
<![CDATA[data]]>
</test>
Which is not the same XML.
I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory
with setIgnoringElementContentWhitespace=true
but this no longer works for Java 11.
Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).
Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.
Sample program to demonstrate the issue:
public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
String data = "data";
StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
StreamResult result = new StreamResult(new StringWriter());
Transformer tform = TransformerFactory.newInstance().newTransformer();
tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.transform(source, result);
String xml = result.getWriter().toString();
System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11
Document document = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)));
String resultData = document.getElementsByTagName("bar")
.item(0)
.getTextContent();
System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}
EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291
As your code relies on unspecified behavior, extra explicit code seems better:
You want indentation like:
tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
However not for elements containing a CDATA.
String xml = result.getWriter().toString();
// No indentation (whitespace) for elements with a CDATA section.
xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");
The regex uses:
(?s)
DOT_ALL to have .
match any character, also newline characters..*?
the shortest matching sequence, to not match "...]]>...]]>".Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With