XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.
How do I filter out these characters from a String in Java?
simple test case:
Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,
http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm
This page includes a Java method for stripping out invalid XML characters by testing whether each character is within spec, though it doesn't check for highly discouraged characters
Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With