Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping Invalid XML characters in Java

Tags:

java

xml

I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'

Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.

like image 451
Mason Avatar asked Sep 18 '08 15:09

Mason


People also ask

How do you escape an invalid character in XML?

Escape(yourstring) ? This will replace invalid XML characters in a string with their valid equivalent.

How do I find an invalid character in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.


1 Answers

I used Xalan org.apache.xml.utils.XMLChar class:

public static String stripInvalidXmlCharacters(String input) {     StringBuilder sb = new StringBuilder();     for (int i = 0; i < input.length(); i++) {         char c = input.charAt(i);         if (XMLChar.isValid(c)) {             sb.append(c);         }     }      return sb.toString(); } 
like image 94
Bozho Avatar answered Sep 19 '22 06:09

Bozho