I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'
Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.
Escape(yourstring) ? This will replace invalid XML characters in a string with their valid equivalent.
If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.
I used Xalan org.apache.xml.utils.XMLChar
class:
public static String stripInvalidXmlCharacters(String input) { StringBuilder sb = new StringBuilder(); for (int i = 0; i < input.length(); i++) { char c = input.charAt(i); if (XMLChar.isValid(c)) { sb.append(c); } } return sb.toString(); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With