The System.Xml parsing features had a few surprises for me in store, and I wonder how the following should be interpreted, or if this is "up to the implementation":
Version 1:
<root><elem>
<![CDATA[MyValue]]>
</elem></root>
Version 2:
<root><elem>
-<![CDATA[MyValue]]>-
</elem></root>
What should be the value of elem
? Or is it okay that this depends on the implementation that parses it, and should I just deal with that?
I expected (at first) that in both cases all whitespace between the start/end node and the first non-whitespace character would be ignored. This is not the case, but failing that, I would've at least expected it to never be ignored, but this is also not the case. See full repro below for my expectations.
To elaborate...
Two cases had me stumped when I tested them:
XDocument.Parse
will suddenly start to include the \n\t
whitespace in example 2, whereas it ignored it in example 1.XDocument.Load
with new XmlReaderSettings {IgnoreWhitespace = true}
will behave similarly.What gives? Is this just the implementation being (to my taste) quirky, and/or is this specified behavior?
Here's a full repro of my expectations (fresh C# class library project with latest NUnit package from NuGet):
[TestFixture]
public class XmlTests
{
public static XDocument ParseDocument(string input)
{
return XDocument.Parse(input);
}
public static XDocument LoadDocument(Stream stream)
{
var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = false }); // Default
return XDocument.Load(xmlReader);
}
public static XDocument LoadDocument_IgnoreWhitespace(Stream stream)
{
var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = true });
return XDocument.Load(xmlReader);
}
const string example1 = "<root><elem>\n\t<![CDATA[MyValue]]>\n</elem></root>";
const string example2 = "<root><elem>\n\t-<![CDATA[MyValue]]>-\n</elem></root>";
[Test]
public void A_Parsing_Example1_WorksAsExpected()
{
var doc = ParseDocument(example1);
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
[Test]
public void B_Loading_Example1_WorksAsExpected()
{
var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("\n\tMyValue\n"));
}
[Test]
public void C_LoadingWithIgnoreWhitespace_Example1_WorksAsExpected()
{
var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
[Test]
public void D_Parsing_Example2_WorksAsExpected()
{
var doc = ParseDocument(example2);
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("-MyValue-"));
}
[Test]
public void E_Loading_Example2_WorksAsExpected()
{
var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("\n\t-MyValue-\n"));
}
[Test]
public void F_LoadingWithIgnoreWhitespace_Example2_WorksAsExpected()
{
var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
var element = doc.Descendants("elem").Single();
Assert.That(element.Value, Is.EqualTo("MyValue"));
}
}
CDATAs are difficult. They are not changed by the parser (read). They are not allowed to include invalid characters or ]]>
. However some implementations will change them to generate valid XML output (write).
The content of elem
depends on the parser and if it ignores the whitespace nodes. elem
has 3 child nodes.
\n\t
"MyValue
"\n
"So like you noticed if the whitespace nodes are ignored, only the cdata remains. In you second example the result would be different (If repaired).
\n\t-
"MyValue
"-\n
"The first and third node now have non whitespace content (the -). They are no whitespace nodes any more and not ignored depending on the option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With