Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing invalid XML characters from a string in java

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

like

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

thanks.

like image 807
yossi Avatar asked Nov 21 '10 11:11

yossi


People also ask

How do I remove an invalid character in XML?

<xml>You can use &lt;b&gt;&lt;/b&gt; to highlight stuff in HTML. </xml>. or not.

How do I find an invalid character in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.


1 Answers

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0 // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml10pattern = "[^"                     + "\u0009\r\n"                     + "\u0020-\uD7FF"                     + "\uE000-\uFFFD"                     + "\ud800\udc00-\udbff\udfff"                     + "]"; 

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1 // [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] String xml11pattern = "[^"                     + "\u0001-\uD7FF"                     + "\uE000-\uFFFD"                     + "\ud800\udc00-\udbff\udfff"                     + "]+"; 

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0"; String legal = illegal.replaceAll(pattern, ""); 
like image 199
McDowell Avatar answered Sep 30 '22 19:09

McDowell