Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.
like
line.replace(regExp,"");
what is the right regExp to use ?
invalid XML character is everything that is not this :
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
thanks.
<xml>You can use <b></b> to highlight stuff in HTML. </xml>. or not.
If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0 // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml10pattern = "[^" + "\u0009\r\n" + "\u0020-\uD7FF" + "\uE000-\uFFFD" + "\ud800\udc00-\udbff\udfff" + "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1 // [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml11pattern = "[^" + "\u0001-\uD7FF" + "\uE000-\uFFFD" + "\ud800\udc00-\udbff\udfff" + "]+";
You will need to use String.replaceAll(...)
and not String.replace(...)
.
String illegal = "Hello, World!\0"; String legal = illegal.replaceAll(pattern, "");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With