Regarding this question: removing invalid XML characters from a string in java, in @McDowell response he/she said that a way to remove invalid XML characters is:
String xml10pattern = "[^"
+ "\u0009\r\n" // #x9 | #xA | #xD
+ "\u0020-\uD7FF" // [#x20-#xD7FF]
+ "\uE000-\uFFFD" // [#xE000-#xFFFD]
+ "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
+ "]";
and then:
replaceAll(xml10pattern, "");
Well, I have two questions:
\\u0009\\u000A\\u000D..., instead of \u0009\r\n, like I've seen in @ogrisel's response: Stripping Invalid XML characters in Java(U+10000–U+10FFFF) converts into "\ud800\udc00-\udbff\udfff". Couldn't it be "\u10000-\u10FFFF"? I really have to detect or filter this kind of characters, and I'm not completely sure how to do it.
By the way, this have to work on JDK 1.5 (so, expressions like \x{h...h} are not allowed)
Thanks a lot.
======UPDATE======
The way I was thinking to detect if an String str contains such invalid characters is:
if (!str.replaceAll(pattern, "").equals(str)) {
// Contains non XML valid characters.
}
Any other advice would be very welcome ;)
1) it works both ways, \u0009 is java escape sequence, \\u0009 is regex escape sequence
2) Java String is UTF-16 encoded, U+10000 is encoded with 2 16-bit characters \ud800\udc00, see Character API Unicode Character Representations
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With