Removing useless TextNodes in HtmlAgilityPack

Question

I'm scraping a number of websites using HtmlAgilityPack. The problem is that it seems to insist on inserting TextNodes in most places which are either empty or just contain a mass of , whitespaces and .

They tend to cause me issues when I'm counting childnodes , since firebug doesn't show them, but HtmlAgilityPack does.

Is there a way of telling HtmlAgilityPack to stop doing it, or at least clearing out these textnodes? (I want to keep USEFUL ones though). While we're here, same thing for Comment and Script tags.

Honza Kalfus · Accepted Answer

You can use the following extension method:

static class HtmlNodeExtensions
{
    public static List<HtmlNode> GetChildNodesDiscardingTextOnes(this HtmlNode node)
    {
        return node.ChildNodes.Where(n => n.NodeType != HtmlNodeType.Text).ToList();
    }
}

And call it like this:

List<HtmlNode> nodes = someNode.GetChildNodesDiscardingTextOnes();

Removing useless TextNodes in HtmlAgilityPack

Tags:

c#

html-agility-pack

web-scraping

Aabela

1 Answers

Honza Kalfus

Recent Activity

Donate For Us

Removing useless TextNodes in HtmlAgilityPack

Tags:

c#

html-agility-pack

web-scraping

Aabela

1 Answers

Honza Kalfus

Related questions

Recent Activity

Donate For Us