Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse badly formed XML in Java?

I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:

<mytag>This won't parse & contains an ampersand.</mytag>

The javax.xml.stream classes don't like this at all, and rightly error with:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.

How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.

My preference would be for a fix that doesn't require too much disruption to the existing parser code.

like image 676
izb Avatar asked May 28 '09 11:05

izb


People also ask

What is the best way to parse XML in Java?

DOM Parser is the easiest java xml parser to learn. DOM parser loads the XML file into memory and we can traverse it node by node to parse the XML. DOM Parser is good for small files but when file size increases it performs slow and consumes more memory.

What is XML parsing error not well-formed?

"XML Parsing Error" occurs when something is trying to read the XML, not when it is being generated. Also, "not well-formed" usually refers to errors in the structure of the document, such as a missing end-tag, not the characters it contains.


2 Answers

Use libraries such as tidy or tagsoup.

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

like image 87
alamar Avatar answered Sep 30 '22 23:09

alamar


If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.

Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.

like image 42
Brian Agnew Avatar answered Sep 30 '22 22:09

Brian Agnew