Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linq-to-XML XElement.Remove() leaves unwanted whitespace

I have an XDocument that I create from a byte array (received over tcp/ip).

I then search for specific xml nodes (XElements) and after retrieving the value 'pop' it off of the Xdocument by calling XElement.Remove(). After all of my parsing is complete, I want to be able to log the xml that I did not parse (the remaining xml in the XDocument). The problem is that there is extra whitespace that remains when XElement.Remove() is called. I want to know the best way to remove this extra whitespace while preserving the rest of the format in the remaining xml.

Example/Sample Code

If I receive the following xml over the socket:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
</catalog>

And I use the following code to parse this xml and remove a number of the XElements:

private void socket_messageReceived(object sender, MessageReceivedEventArgs e)
{
     XDocument xDoc;
     try
     {
         using (MemoryStream xmlStream = new MemoryStream(e.XmlAsBytes))
         using (XmlTextReader reader = new XmlTextReader(xmlStream))
         {
             xDoc = XDocument.Load(reader);
         }

         XElement Author = xDoc.Root.Descendants("author").FirstOrDefault();
         XElement Title  = xDoc.Root.Descendants("title").FirstOrDefault();
         XElement Genre  = xDoc.Root.Descendants("genre").FirstOrDefault();

         // Do something with Author, Title, and Genre here...

         if (Author != null) Author.Remove();
         if (Title  != null) Title.Remove();
         if (Genre  != null) Genre.Remove();

         LogUnparsedXML(xDoc.ToString());

     }
     catch (Exception ex)
     {
         // Exception Handling here...
     }
}

Then the resulting string of xml sent to the LogUnparsedXML message would be:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">



      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
</catalog>

In this contrived example it may not seem like a big deal, but in my actual application the leftover xml looks pretty sloppy. I have tried using the XDocument.ToString overload that takes a SaveOptions enum to no avail. I have also tried to call xDoc.Save to save out to a file using the SaveOptions enum. I did try experimenting with a few different linq queries that used XElement.Nodes().OfType<XText>() to try to remove the whitespace, but often I ended up taking the whitespace that I wish to preserve along with the whitespace that I am trying to get rid of.

Thanks in advance for assistance.

Joe

like image 421
Joe DePung Avatar asked Jul 27 '11 21:07

Joe DePung


3 Answers

I have a simpler solution than the accepted answer that works for my case and appears to work for yours too. Perhaps there are some more complicated cases it will not work for though, I'm not sure.

Here is the code:

public static void RemoveWithNextWhitespace(this XElement element)
{
    if (element.PreviousNode is XText textNode)
    {
        textNode.Remove();
    }

    element
    .Remove();
}

Here is my LINQPad query with your use case:

void Main()
{
    var xDoc = XDocument.Parse(@"<?xml version=""1.0""?>
<catalog>
   <book id=""bk101"">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
</catalog>", LoadOptions.PreserveWhitespace);

    XElement Author = xDoc.Root.Descendants("author").FirstOrDefault();
    XElement Title = xDoc.Root.Descendants("title").FirstOrDefault();
    XElement Genre = xDoc.Root.Descendants("genre").FirstOrDefault();

    // Do something with Author, Title, and Genre here...

    if (Author != null) Author.RemoveWithNextWhitespace();
    if (Title != null) Title.RemoveWithNextWhitespace();
    if (Genre != null) Genre.RemoveWithNextWhitespace();

    xDoc.ToString().Dump();
}

static class Ext
{
    public static void RemoveWithNextWhitespace(this XElement element)
    {
        if (element.PreviousNode is XText textNode)
        {
            textNode.Remove();
        }

        element
        .Remove();
    }
}

The main reason why I didn't just use the accepted answer myself was because it did not leave my XML properly formatted in some cases. e.g. in your use case if I removed the "description" element it would leave something that looked like this:

<catalog>
   <book id="bk101">
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
         </book>
</catalog>
like image 180
phillhutt Avatar answered Sep 23 '22 19:09

phillhutt


It's not easy to answer in a portable way, because the solution heavily depends on how XDocument.Load() generates whitespace text nodes (and there are several implementations of LINQ to XML around that might disagree about that subtle detail).

That said, it looks like you're never removing the last child (<description>) from the <book> elements. If that's indeed the case, then we don't have to worry about the indentation of the parent element's closing tag, and we can just remove the element and all its following text nodes until we reach another element. TakeWhile() will do the job.

EDIT: Well, it seems you need to remove the last child after all. Therefore, things will get more complicated. The code below implements the following algorithm:

  • If the element is not the last element of its parent:
    • Remove all following text nodes until we reach the next element.
  • Otherwise:
    • Remove all following text nodes until we find one containing a newline,
    • If that node only contains a newline:
      • Remove that node.
    • Otherwise:
      • Create a new node containing only the whitespace found after the newline,
      • Insert that node after the original node,
      • Remove the original node.
  • Remove the element itself.

The resulting code is:

public static void RemoveWithNextWhitespace(this XElement element)
{
    IEnumerable<XText> textNodes
        = element.NodesAfterSelf()
                 .TakeWhile(node => node is XText).Cast<XText>();
    if (element.ElementsAfterSelf().Any()) {
        // Easy case, remove following text nodes.
        textNodes.ToList().ForEach(node => node.Remove());
    } else {
        // Remove trailing whitespace.
        textNodes.TakeWhile(text => !text.Value.Contains("\n"))
                 .ToList().ForEach(text => text.Remove());
        // Fetch text node containing newline, if any.
        XText newLineTextNode
            = element.NodesAfterSelf().OfType<XText>().FirstOrDefault();
        if (newLineTextNode != null) {
            string value = newLineTextNode.Value;
            if (value.Length > 1) {
                // Composite text node, trim until newline (inclusive).
                newLineTextNode.AddAfterSelf(
                    new XText(value.SubString(value.IndexOf('\n') + 1)));
            }
            // Remove original node.
            newLineTextNode.Remove();
        }
    }
    element.Remove();
}

From there, you can do:

if (Author != null) Author.RemoveWithNextWhitespace();
if (Title  != null) Title.RemoveWithNextWhitespace();
if (Genre  != null) Genre.RemoveWithNextWhitespace();

Though I would suggest you replace the above with something like a loop fed from an array or a params method call , to avoid code redundancy.

like image 31
Frédéric Hamidi Avatar answered Sep 22 '22 19:09

Frédéric Hamidi


Reading xml via an XmlReader will preserve whitespace by default, including insignificant whitespace as you see here.

You should read it in ignoring whitespace by setting the appropriate xml reader setting:

using (var reader = XmlReader.Create(xmlStream, new XmlReaderSettings { IgnoreWhitespace = true }))

Note this doesn't remove significant whitespace (such as those in mixed content or in a scope preserving whitespace) so your formatting will remain.

like image 44
Jeff Mercado Avatar answered Sep 22 '22 19:09

Jeff Mercado