Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should the '\t' character be handled within XML attribute values?

Tags:

c#

.net

xml

I seem to have found something of an inconsistency between the various XML implementations within .Net 3.5 and I'm struggling to work out which is nominally correct.

The issue is actually fairly easy to reproduce:

  1. Create a simple xml document with a text element containing '\t' characters and give it an attribute that contains '\t' characters:

    var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
    xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");
    xmlDoc.Save(@"d:\TabTest.xml");
    

    NB: This means that XmlDocument itself is quite happy with '\t' characters in an attribuite value.

  2. Load the document using new XmlTextReader:

    var rawFile = XmlReader.Create(@"D:\TabTest.xml");
    var rawDoc = new XmlDocument();
    rawDoc.Load(rawFile);
    
  3. Load the document using XmlReader.Create:

    var rawFile2 = new XmlTextReader(@"D:\TabTest.xml");
    var rawDoc2 = new XmlDocument();
    rawDoc2.Load(rawFile2);
    
  4. Compare the documents in the debugger:

    (rawDoc).InnerXml   "<test><text attrib=\"Tab' 'space' '\">Tab'\t'space' '</text></test>"   string
    (rawDoc2).InnerXml  "<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>"  string
    

The document read using new XmlTextReader was what I expected, both the '\t' in the text value and attribute value was there as expected. However, if you look at the document read by XmlReader.Create you find that the '\t' character in the attribute value will have been converted into a ' ' character.

What the....!! :-)

After a bit of a Google search I found that I could encode a '\t' as '&#x9;' - if I used this instead of '\t' in the example XML both readers work as expected.

Now Altova XmlSpy and various other XML readers seem to be perfectly happy with '\t' characters in attribute values, my question is what is the correct way to handle this?

Should I be writing XML file with '\t' characters encoded in attribute values like XmlReader.Create expects or are the other XML tools right and '\t' characters are valid and XmlReader.Create is broken?

Which way should I go to fix/work around this issue?

like image 598
SteveH Avatar asked Sep 04 '12 09:09

SteveH


Video Answer


2 Answers

Probably something to do with Attribute Value Normalization. For CDATA attributes an XML parser is required to replace newlines and tabs in attribute values by spaces, unless they are written in escaped form as character references.

like image 73
Michael Kay Avatar answered Oct 29 '22 14:10

Michael Kay


@all: Thanks for all your answers and comments.

It would seem that Justin and Michael Kay are correct and white space should be encoded according to the W3C XML specifications and that the issue is that a significant number of the MS implementations do not honour this requirement.

In my case, XML specification aside, all I really want is for the attribute values to be correctly persisted - i.e. the values saved should be exactly the values read.

The answer to that is to force the use of an XmlWriter created by using XmlWriter.Create method when saving the XML files in the first place.

While both Dataset and XmlDocument provide save/write mechanisms neither of them correctly encode white space in attributes when used in their default form. If I force them to use a manually created XmlWriter, however, the correct encoding is applied and written to the file.

So the original file save code becomes:

var xmlDoc = new XmlDocument { PreserveWhitespace = false, };
xmlDoc.LoadXml("<test><text attrib=\"Tab'\t'space' '\">Tab'\t'space' '</text></test>");

using (var xmlWriter = XmlWriter.Create(@"d:\TabTest.Encoded.xml"))
{
    xmlDoc.Save(xmlWriter);
}

This writer then correctly encodes the white space in a symmetrical way for the XmlReader.Create reader to read without altering the attribute values.

The other thing to note here is that this solution encapsulates the encoding from my code entirely as the reader and writer perform the encoding and decoding transparently on read and write.

like image 44
SteveH Avatar answered Oct 29 '22 16:10

SteveH