Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Parsing Libraries for .NET [closed]

I'm looking for libraries to parse HTML to extract links, forms, tags etc.

  • http://www.majestic12.co.uk/projects/html_parser.php
  • http://www.netomatix.com/Products/DocumentManagement/HtmlParserNet.aspx
  • http://www.developer.com/net/csharp/article.php/2230091

LGPL or any other commercial development friendly licenses are preferable.

Have you got any experience with one of this libraries? Or could you recommend another similar library?

like image 783
dr. evil Avatar asked Dec 30 '22 03:12

dr. evil


1 Answers

The HTML Agility Pack has examples of exactly this type of thing, and uses xpath for familiar queries - for example (from home page), to find all links is simply:

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")) {
    //...
}

EDIT

As of 6/19/2012, the code above, as well as the only code sample shown on HTML Agility Pack Examples page won't work. Just needs slight tweaking as shown below.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
  HtmlAttribute att = link.Attributes["href"];
  att.Value = Foo(att); // fix the link
}
doc.Save("file.htm");
like image 100
Marc Gravell Avatar answered Jan 22 '23 01:01

Marc Gravell