Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sanitizing bad XML in Java

Tags:

java

xml

I'm using a third-party library that returns "XML" that is not valid, because it contains invalid characters, as well as non-declared entities. I need to use a Java XML parser to parse this XML, but it's choking.

Is there a generic way to sanitize this XML so that it becomes valid?

like image 870
bajafresh4life Avatar asked Oct 28 '08 16:10

bajafresh4life


2 Answers

I think your options are something like:

  • Tag Soup
  • JTidy
  • Roll your own.

The first two are more heavyweight, given that they're designed to parse ill formed HTML. If you know that the problems are due to encoding and entities, but otherwise well formed I'd suggest you roll your own:

  • standardize an encoding to UTF-8
  • use a standard encoder for the text between the > and < characters (text entities).
like image 69
jamesh Avatar answered Nov 10 '22 04:11

jamesh


Sounds like you need to figure out if there's a way to automatically clean the data yourself before handing off to a parser. How are certain characters invalid, not valid in the declared character set, or unescaped XML meta-characters such as '<'?

For non-declared entities, I once solved this by configuring a SAX parser with an error handler which basically ignored these errors. That might help you too. See ErrorHandler API.

like image 23
Dov Wasserman Avatar answered Nov 10 '22 02:11

Dov Wasserman