Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should text nodes with CDATA and whitespace be interpreted in XML?

Tags:

c#

xml

The System.Xml parsing features had a few surprises for me in store, and I wonder how the following should be interpreted, or if this is "up to the implementation":

Version 1:

<root><elem>
    <![CDATA[MyValue]]>
</elem></root>

Version 2:

<root><elem>
    -<![CDATA[MyValue]]>-
</elem></root>

What should be the value of elem? Or is it okay that this depends on the implementation that parses it, and should I just deal with that?

I expected (at first) that in both cases all whitespace between the start/end node and the first non-whitespace character would be ignored. This is not the case, but failing that, I would've at least expected it to never be ignored, but this is also not the case. See full repro below for my expectations.


To elaborate...

Two cases had me stumped when I tested them:

  • XDocument.Parse will suddenly start to include the \n\t whitespace in example 2, whereas it ignored it in example 1.
  • XDocument.Load with new XmlReaderSettings {IgnoreWhitespace = true} will behave similarly.

What gives? Is this just the implementation being (to my taste) quirky, and/or is this specified behavior?

Here's a full repro of my expectations (fresh C# class library project with latest NUnit package from NuGet):

[TestFixture]
public class XmlTests
{
    public static XDocument ParseDocument(string input)
    {
        return XDocument.Parse(input);
    }

    public static XDocument LoadDocument(Stream stream)
    {
        var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = false }); // Default
        return XDocument.Load(xmlReader);
    }

    public static XDocument LoadDocument_IgnoreWhitespace(Stream stream)
    {
        var xmlReader = XmlReader.Create(stream, new XmlReaderSettings() { IgnoreWhitespace = true });
        return XDocument.Load(xmlReader);
    }

    const string example1 = "<root><elem>\n\t<![CDATA[MyValue]]>\n</elem></root>";
    const string example2 = "<root><elem>\n\t-<![CDATA[MyValue]]>-\n</elem></root>";

    [Test]
    public void A_Parsing_Example1_WorksAsExpected()
    {
        var doc = ParseDocument(example1);
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("MyValue"));
    }

    [Test]
    public void B_Loading_Example1_WorksAsExpected()
    {
        var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("\n\tMyValue\n"));
    }

    [Test]
    public void C_LoadingWithIgnoreWhitespace_Example1_WorksAsExpected()
    {
        var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example1)));
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("MyValue"));
    }

    [Test]
    public void D_Parsing_Example2_WorksAsExpected()
    {
        var doc = ParseDocument(example2);
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("-MyValue-"));
    }

    [Test]
    public void E_Loading_Example2_WorksAsExpected()
    {
        var doc = LoadDocument(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("\n\t-MyValue-\n"));
    }

    [Test]
    public void F_LoadingWithIgnoreWhitespace_Example2_WorksAsExpected()
    {
        var doc = LoadDocument_IgnoreWhitespace(new MemoryStream(Encoding.UTF8.GetBytes(example2)));
        var element = doc.Descendants("elem").Single();
        Assert.That(element.Value, Is.EqualTo("MyValue"));
    }
}
like image 558
Jeroen Avatar asked Sep 30 '22 05:09

Jeroen


1 Answers

CDATAs are difficult. They are not changed by the parser (read). They are not allowed to include invalid characters or ]]>. However some implementations will change them to generate valid XML output (write).

The content of elem depends on the parser and if it ignores the whitespace nodes. elem has 3 child nodes.

  1. whitespace text node with content "\n\t"
  2. cdata section node with content "MyValue"
  3. whitespace text node with content "\n"

So like you noticed if the whitespace nodes are ignored, only the cdata remains. In you second example the result would be different (If repaired).

  1. text node with content "\n\t-"
  2. cdata section node with content "MyValue"
  3. text node with content "-\n"

The first and third node now have non whitespace content (the -). They are no whitespace nodes any more and not ignored depending on the option.

like image 192
ThW Avatar answered Oct 04 '22 02:10

ThW