Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

delete some tags from html document with c#

Tags:

html

c#

I have a html document and I want to delete all the divs of certain class (with all the content). What is the simplest way to do it?

Thank you for your help.

UPDATED:

I tried out Html Agility Pack as you adviced, but I failed to reach the aim. I have the following code

        static void Main()
        {
            HtmlDocument document = new HtmlDocument();
            document.Load(FileName);
            HtmlNode node = document.DocumentNode;
            HandleNode(node);
        }

    private static void HandleNode(HtmlNode node)
    {
        while (node != null)
        {
            if (node.Name == "div")
            {
                var attribute = node.Attributes.Where(x => x.Name == "class" && x.Value == "NavContent");
                if (attribute.Any())
                    node.Remove();
            }
            foreach (var childNode in node.ChildNodes)
            {
                HandleNode(childNode);
            }
        }

    }

But it doesn't do want I want. The recursion never ends and the node name is always comment. Here's the htmp-document I'm trying to parse: http://en.wiktionary.org/wiki/work Is there a good example how to work with Html Agility Pack? What's wrong with this piece of code?
like image 551
StuffHappens Avatar asked Dec 13 '22 23:12

StuffHappens


2 Answers

It depends on how complex your HTML is, but you will probably need the Agility Pack library.

Re the Update:

HandleNode() contains a while(node != null) loop but never assigns to node. I would change that to an if(...) to start with.

like image 189
Henk Holterman Avatar answered Dec 15 '22 13:12

Henk Holterman


You're looking for the HTML Agility Pack.

like image 25
SLaks Avatar answered Dec 15 '22 13:12

SLaks