Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering illegal XML characters in Java

Tags:

java

xml

unicode

XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.

How do I filter out these characters from a String in Java?

simple test case:

  Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
like image 243
Grzegorz Oledzki Avatar asked May 24 '10 12:05

Grzegorz Oledzki


2 Answers

It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

like image 193
ZZ Coder Avatar answered Oct 29 '22 06:10

ZZ Coder


This page includes a Java method for stripping out invalid XML characters by testing whether each character is within spec, though it doesn't check for highly discouraged characters

Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.

like image 31
Stephen C Avatar answered Oct 29 '22 06:10

Stephen C