The System.Xml parsing features had a few surprises for me in store, and I wonder how the following should be interpreted, or if this is "up to the implementation":
Version 1:
<root><elem>
<![CDATA[MyValue]]>
</elem></root>
Version 2:
<root><elem>
-<![CDATA[MyValue]]>-
</elem></root>
What should be the value of elem? Or is it okay that this depends on the implementation that parses it, and should I just deal with that?
I expected (at first) that in both cases all whitespace between the start/end node and the first non-whitespace character would be ignored. This is not the case, but failing that, I would've at least expected it to never be ignored, but this is also not the case. See full repro below for my expectations.
To elaborate...
Two cases had me stumped when I tested them:
XDocument.Parse will suddenly start to include the \n\t whitespace in example 2, whereas it ignored it in example 1.XDocument.Load with new XmlReaderSettings {IgnoreWhitespace = true} will behave similarly.What gives? Is this just the implementation being (to my taste) quirky, and/or is this specified behavior?
Here's a full repro of my expectations (fresh C# class library project with latest NUnit package from NuGet):
[TestFixture]
public class XmlTests
{
public static XDocument ParseDocument(string input)
{
return XDocument.Parse(input);
}
public static XDocument LoadDocument(Stream stream)
{
var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = false }); // Default
return XDocument.Load(xmlReader);
}
public static XDocument LoadDocument_IgnoreWhitespace(Stream stream)
{
var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = true });
return XDocument.Load(xmlReader);
}
const string example1 = "<root><elem>\n\t<![CDATA[MyValue]]>\n</elem></root>";
const string example2 = "<root><elem>\n\t-<![CDATA[MyValue]]>-\n</elem></root>";
[Test]
public void A_Parsing_Example1_WorksAsExpected()
{
var doc = ParseDocument(example1);
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
[Test]
public void B_Loading_Example1_WorksAsExpected()
{
var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("\n\tMyValue\n"));
}
[Test]
public void C_LoadingWithIgnoreWhitespace_Example1_WorksAsExpected()
{
var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
[Test]
public void D_Parsing_Example2_WorksAsExpected()
{
var doc = ParseDocument(example2);
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("-MyValue-"));
}
[Test]
public void E_Loading_Example2_WorksAsExpected()
{
var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("\n\t-MyValue-\n"));
}
[Test]
public void F_LoadingWithIgnoreWhitespace_Example2_WorksAsExpected()
{
var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
}
CDATAs are difficult. They are not changed by the parser (read). They are not allowed to include invalid characters or ]]>. However some implementations will change them to generate valid XML output (write).
The content of elem depends on the parser and if it ignores the whitespace nodes. elem has 3 child nodes.
\n\t"MyValue"\n"So like you noticed if the whitespace nodes are ignored, only the cdata remains. In you second example the result would be different (If repaired).
\n\t-"MyValue"-\n"The first and third node now have non whitespace content (the -). They are no whitespace nodes any more and not ignored depending on the option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With