Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Agility Pack RemoveChild - not behaving as expected

Say I want to remove the span tag from this html:

<html><span>we do like <b>bold</b> stuff</span></html>

I'm expecting this chunk of code to do what I'm after

string html = "<html><span>we do like <b>bold</b> stuff</span></html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode span = doc.DocumentNode.Descendants("span").First();
span.ParentNode.RemoveChild(span, true); //second parameter is 'keepGrandChildren'

But the output looks like this:

<html> stuff<b>bold</b>we do like </html>

It appears to be reversing the child nodes within the span. Am I doing something wrong?

like image 248
russau Avatar asked Oct 27 '11 03:10

russau


1 Answers

Looks like a bug in HtmlAgilityPack - see their issue register:

http://htmlagilitypack.codeplex.com/workitem/9113

Interestingly this was raised 4 years ago...

Here's a snippet that will remove all span tags (or any other tag you specify) and keeps other nodes in the correct order.

void Main()
{
    string html = "<html><span>we do like <b>bold</b> stuff</span></html>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    RemoveTags(doc, "span");
    Console.WriteLine(doc.DocumentNode.OuterHtml);
}

public static void RemoveTags(HtmlDocument html, string tagName)
{
    var tags = html.DocumentNode.SelectNodes("//" + tagName);
    if (tags!=null)
    {
        foreach (var tag in tags)
        {
            if (!tag.HasChildNodes)
            {
                tag.ParentNode.RemoveChild(tag);
                continue;
            }

            for (var i = tag.ChildNodes.Count - 1; i >= 0; i--)
            {
                var child = tag.ChildNodes[i];
                tag.ParentNode.InsertAfter(child, tag);
            }
            tag.ParentNode.RemoveChild(tag);
        }
    }
}
like image 191
Spud Avatar answered Dec 05 '22 04:12

Spud