Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML page with HtmlAgilityPack

Using C# I would like to know how to get the Textbox value (i.e: john) from this sample html script :

<TD class=texte width="50%"> <DIV align=right>Name :<B> </B></DIV></TD> <TD width="50%"><INPUT class=box value=John maxLength=16 size=16 name=user_name> </TD> <TR vAlign=center> 
like image 739
Hassen Avatar asked Oct 03 '09 01:10

Hassen


People also ask

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

What is HTML agility pack?

For users who are unafamiliar with “HTML Agility Pack“, this is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a . NET code library that allows you to parse “out of the web” files (be it HTML, PHP or aspx).

Is HTML agility pack free?

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a . NET code library that allows you to parse "out of the web" HTML files.

Can we parse HTML?

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.


2 Answers

There are a number of ways to select elements using the agility pack.

Let's assume we have defined our HtmlDocument as follows:

string html = @"<TD class=texte width=""50%""> <DIV align=right>Name :<B> </B></DIV></TD> <TD width=""50%"">     <INPUT class=box value=John maxLength=16 size=16 name=user_name> </TD> <TR vAlign=center>";  HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(html); 

1. Simple LINQ
We could use the Descendants() method, passing the name of an element we are in search of:

var inputs = htmlDoc.DocumentNode.Descendants("input");  foreach (var input in inputs) {     Console.WriteLine(input.Attributes["value"].Value);     // John } 

2. More advanced LINQ
We could narrow that down by using fancier LINQ:

var inputs = from input in htmlDoc.DocumentNode.Descendants("input")              where input.Attributes["class"].Value == "box"              select input;  foreach (var input in inputs) {     Console.WriteLine(input.Attributes["value"].Value);     // John } 

3. XPath
Or we could use XPath.

string name = htmlDoc.DocumentNode     .SelectSingleNode("//td/input")     .Attributes["value"].Value;  Console.WriteLine(name); //John 
like image 120
gpmcadam Avatar answered Sep 20 '22 21:09

gpmcadam


HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); XPathNavigator docNav = doc.CreateNavigator();  XPathNavigator node = docNav.SelectSingleNode("//td/input/@value");  if (node != null) {     Console.WriteLine("result: " + node.Value); } 

I wrote this pretty quickly, so you'll want to do some testing with more data.

NOTE: The XPath strings apparently have to be in lower-case.

EDIT: Apparently the beta now supports Linq to Objects directly, so there's probably no need for the converter.

like image 40
TrueWill Avatar answered Sep 18 '22 21:09

TrueWill