Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does en-dash (–) trigger illegal XML character error (C#/SSMS)?

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).

We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:

var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);

This creates an XML file on disk that has contains the following (abbreviated) data:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>

Notice that in the second record, - is different to . I believe the second instance is en-dash.

If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:

declare @xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 25, illegal xml character

So why does en-dash cause the error? From my research, it would appear that

...only a few entities that need escaping: <,>,\,' and & in both HTML and XML. Source

...of which en-dash is not one. An encoded version (replacing with &#8211;) works fine.

UPDATE

Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?

like image 714
EvilDr Avatar asked Apr 04 '16 12:04

EvilDr


2 Answers

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.

The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.

Cause of original problem

This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:

DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 29, illegal xml character

Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 1, character 56, unable to switch the encoding

Solutions

Alternatives that work are:

1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#x2013; Bar" />
</records>';

2) As above but with decimal encoding on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#8211; Bar" />
</records>';

3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):

DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):

DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';
like image 194
EvilDr Avatar answered Sep 20 '22 18:09

EvilDr


Can you modify the XML encoding declaration? If so;

declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

select @xml

(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>

Speculative Edit

Both of these fail with illegal xml character:

set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'

because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)

This does pass a nvarchar string to the parser, but fails with unable to switch the encoding:

set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'

This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the is not equivalent in the two encodings

This works as everything is utf-16

set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'
like image 43
Alex K. Avatar answered Sep 21 '22 18:09

Alex K.