Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XDocument.Save() removes my 
 entities

I wrote a tool to repair some XML files (i.e., insert some attributes/values that were missing) using C# and Linq-to-XML. The tool loads an existing XML file into an XDocument object. Then, it parses down through the node to insert the missing data. After that, it calls XDocument.Save() to save the changes out to another directory.

All of that is just fine except for one thing: any 
 entities that are in the text in the XML file are replaced with a new line character. The entity represents a new line, of course, but I need to preserve the entity in the XML because another consumer needs it in there.

Is there any way to save the modified XDocument without losing the 
 entities?

Thank you.

like image 674
mahdaeng Avatar asked Jan 10 '12 23:01

mahdaeng


1 Answers

The 
 entities are technically called “numeric character references” in XML, and they are resolved when the original document is loaded into the XDocument. This makes your issue problematic to solve, since there is no way of distinguishing resolved whitespace entities from insignificant whitespace (typically used for formatting XML documents for plain-text viewers) after the XDocument has been loaded. Thus, the below only applies if your document does not have any insignificant whitespace.

The System.Xml library allows one to preserve whitespace entities by setting the NewLineHandling property of the XmlWriterSettings class to Entitize. However, within text nodes, this would only entitize \r to 
, and not \n to 
.

The easiest solution is to derive from the XmlWriter class and override its WriteString method to manually replace the whitespace characters with their numeric character entities. The WriteString method also happens to be the place where .NET entitizes characters that are not permitted to appear in text nodes, such as the syntax markers &, <, and >, which are respectively entitized to &amp;, &lt;, and &gt;.

Since XmlWriter is abstract, we shall derive from XmlTextWriter in order to avoid having to implement all the abstract methods of the former class. Here is a quick-and-dirty implementation:

public class EntitizingXmlWriter : XmlTextWriter
{
    public EntitizingXmlWriter(TextWriter writer) :
        base(writer)
    { }

    public override void WriteString(string text)
    {
        foreach (char c in text)
        {
            switch (c)
            {
                case '\r':
                case '\n':
                case '\t':
                    base.WriteCharEntity(c);
                    break;
                default:
                    base.WriteString(c.ToString());
                    break;
            }
        }
    }
}

If intended for use in a production environment, you’d want to do away with the c.ToString() part, since it’s very inefficient. You can optimize the code by batching substrings of the original text that do not contain any of the characters you want to entitize, and feeding them together into a single base.WriteString call.

A word of warning: The following naive implementation will not work, since the base WriteString method would replace any & characters with &amp;, thereby causing \r to be expanded to &amp;#xA;.

    public override void WriteString(string text)
    {
        text = text.Replace("\r", "&#xD;");
        text = text.Replace("\n", "&#xA;");
        text = text.Replace("\t", "&#x9;");
        base.WriteString(text);
    }

Finally, to save your XDocument into a destination file or stream, just use the following snippet:

using (var textWriter = new StreamWriter(destination))
using (var xmlWriter = new EntitizingXmlWriter(textWriter))
    document.Save(xmlWriter);

Hope this helps!

Edit: For reference, here is an optimized version of the overridden WriteString method:

public override void WriteString(string text)
{
    // The start index of the next substring containing only non-entitized characters.
    int start = 0;

    // The index of the current character being checked.
    for (int curr = 0; curr < text.Length; ++curr)
    {
        // Check whether the current character should be entitized.
        char chr = text[curr];
        if (chr == '\r' || chr == '\n' || chr == '\t')
        {
            // Write the previous substring of non-entitized characters.
            if (start < curr)
                base.WriteString(text.Substring(start, curr - start));

            // Write current character, entitized.
            base.WriteCharEntity(chr);

            // Next substring of non-entitized characters tentatively starts
            // immediately beyond current character.
            start = curr + 1;
        }
    }

    // Write the trailing substring of non-entitized characters.
    if (start < text.Length)
        base.WriteString(text.Substring(start, text.Length - start));
}
like image 99
Douglas Avatar answered Nov 15 '22 22:11

Douglas