Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell Java SAX Parser to ignore invalid character references?

Tags:

When trying to parse incorrect XML with a character reference such as &#x1, Java's SAX Parser dies a horrible death with a fatal error such as

    org.xml.sax.SAXParseException: Character reference "&#x1"
                                   is an invalid XML character.

Is there any way around this? Will I have to clean up the XML file before I hand it off to the SAX Parser? If so, is there an elegant way of going about this?

like image 905
Epaga Avatar asked Jun 08 '10 12:06

Epaga


People also ask

How SAX is an alternative method for parsing XML document?

SAX (Simple API for XML) is an event-driven algorithm for parsing XML documents. SAX is an alternative to the Document Object Model (DOM). Where the DOM reads the whole document to operate on XML, SAX parsers read XML node by node, issuing parsing events while making a step through the input stream.

How is XML passing done with SAX?

SAX: the Simple API for XML SAX is an API used to parse XML documents. It is based on events generated while reading through the document. Callback methods receive those events. A custom handler contains those callback methods.

Can we create an XML document using SAX parser?

It is better to use StAX parser for creating XML documents rather than using SAX parser. Please refer the Java StAX Parser section for the same.

Are SAX and StAX push parsers or pull parsers?

StAX is a bidirectional API, meaning that it can both read and write XML documents. SAX is read only, so another API is needed if you want to write XML documents. SAX is a push API, whereas StAX is pull. The trade-offs between push and pull APIs outlined above apply here.


2 Answers

Use XML 1.1! skaffman is completely right, but you can just stick <?xml version="1.1"?> on the top of your files and you'll be in good shape. If you're dealing with streams, write a wrapper that rewrites or adds that processing instruction.

like image 91
wowest Avatar answered Sep 24 '22 12:09

wowest


You're going to have to clean up your XML, I'm afraid. Such characters are invalid according to the XML spec, and no amount of persuasion is going to convince the parser otherwise.

Valid XML characters for XML 1.0:

  • U+0009
  • U+000A
  • U+000D
  • U+0020U+D7FF
  • U+E000U+FFFD
  • U+10000U+10FFFF

In order to clean up, you'll have to pass the data through a more low-level processor, which treats it as a unicode character stream, removing those characters that are invalid.

like image 22
skaffman Avatar answered Sep 25 '22 12:09

skaffman