Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect non valid XML characters

Regarding this question: removing invalid XML characters from a string in java, in @McDowell response he/she said that a way to remove invalid XML characters is:

String xml10pattern = "[^"
                + "\u0009\r\n" // #x9 | #xA | #xD 
                + "\u0020-\uD7FF" // [#x20-#xD7FF]
                + "\uE000-\uFFFD" // [#xE000-#xFFFD] 
                + "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
                + "]";

and then:

replaceAll(xml10pattern, "");

Well, I have two questions:

  • Shouldn't all unicode characters be escaped? I mean \\u0009\\u000A\\u000D..., instead of \u0009\r\n, like I've seen in @ogrisel's response: Stripping Invalid XML characters in Java
  • I don't undestand how last range (U+10000–U+10FFFF) converts into "\ud800\udc00-\udbff\udfff". Couldn't it be "\u10000-\u10FFFF"?

I really have to detect or filter this kind of characters, and I'm not completely sure how to do it.

By the way, this have to work on JDK 1.5 (so, expressions like \x{h...h} are not allowed)

Thanks a lot.

======UPDATE======

The way I was thinking to detect if an String str contains such invalid characters is:

if (!str.replaceAll(pattern, "").equals(str)) { 
    // Contains non XML valid characters. 
}

Any other advice would be very welcome ;)

like image 544
Albert Avatar asked May 29 '26 06:05

Albert


1 Answers

1) it works both ways, \u0009 is java escape sequence, \\u0009 is regex escape sequence

2) Java String is UTF-16 encoded, U+10000 is encoded with 2 16-bit characters \ud800\udc00, see Character API Unicode Character Representations

like image 191
Evgeniy Dorofeev Avatar answered May 31 '26 20:05

Evgeniy Dorofeev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!