Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xml parse error on illegal character

Tags:

xml

encoding

SO, I am asking as a last resort, as I am completely out of ideas.

I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a -- name, address, email... etc

but some attributes in the xml are encoded very weirdly, for instance- &#x1a (I dont know where the encoding takes place. I assume in the serialization process)

googling those characters I see that it is "Windows-1252" encoding.

The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.

how can I successfully parse it? what solutions do you suggest?

like image 522
bushman Avatar asked Jun 28 '10 23:06

bushman


1 Answers

The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as .

No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out  sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.

Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.

(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)

like image 164
bobince Avatar answered Oct 27 '22 07:10

bobince