I'm using a third-party library that returns "XML" that is not valid, because it contains invalid characters, as well as non-declared entities. I need to use a Java XML parser to parse this XML, but it's choking.
Is there a generic way to sanitize this XML so that it becomes valid?
I think your options are something like:
The first two are more heavyweight, given that they're designed to parse ill formed HTML. If you know that the problems are due to encoding and entities, but otherwise well formed I'd suggest you roll your own:
Sounds like you need to figure out if there's a way to automatically clean the data yourself before handing off to a parser. How are certain characters invalid, not valid in the declared character set, or unescaped XML meta-characters such as '<'?
For non-declared entities, I once solved this by configuring a SAX parser with an error handler which basically ignored these errors. That might help you too. See ErrorHandler API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With