Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Agility Parsing

I would like to parse an HTML table and disaply contents using XML to LINQ in an bound listbox.

I am using HTML Agility pack and using this code.

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.SourceURL");
    HtmlNode rateNode = doc.DocumentNode.SelectSingleNode("//div[@id='FlightInfo_FlightInfoUpdatePanel']");
    string rate = rateNode.InnerText;
    this.richTextBox1.Text = rate;

The HTML looks like this..

<div id="FlightInfo_FlightInfoUpdatePanel">

   <table cellspacing="0" cellpadding="0"><tbody>
     <tr class="">
     <td class="airline"><img src="/images/airline logos/NZ.gif" title="AIR NEW ZEALAND LIMITED. " alt="AIR NEW ZEALAND LIMITED. " /></td>
     <td class="flight">NZ8</td>
     <td class="codeshare">&nbsp;</td>
     <td class="origin">San Francisco</td>
     <td class="date">01 Sep</td>
     <td class="time">17:15</td>
     <td class="est">18:00</td>
     <td class="status">DEPARTED</td>
     </tr>

But it is returning this

NZ8&nbsp;San Francisco01 Sep17:1518:00DEPARTEDAC6103NZ8San Francisco01 Sep17:1518:00DEPARTEDCO6754NZ8San Francisco01 Sep17:1518:00DEPARTEDLH7157NZ8San Francisco01 Sep17:1518:00DEPARTEDUA6754NZ8San Francisco01 Sep17:1518:00DEPARTEDUS5308NZ8San Francisco01 Sep17:1518:00DEPARTEDVS7408NZ8San Francisco01 Sep17:1518:00DEPARTEDEK407&nbsp;Melbourne/Dubai01 Sep17:5017:50DEPARTEDEK413&nbsp;Sydney/Dubai01 Sep18:0018:00DEPARTEDQF44&nbsp;Sydney01 

What I would like is pasrse this to XML format and then use LINQ to XML to parse the XML to a bound listbox itemsource.

I am thinking I need to use a variation of the below for each class, but would like some help.

HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='flight']");
like image 569
Rhys Avatar asked Sep 01 '11 07:09

Rhys


People also ask

How do I use the Html Agility Pack?

Instead of writing your own parsing engine, the HTML Agility Pack has everything you need to find specific DOM elements, traverse through child and parent nodes, and retrieve text and properties (e.g., HREF links) within specified elements. The first step is to install the HTML Agility Pack after you create your C# .NET project.

What are the differences between Html Agility Pack and XPath?

Html Agility Pack by default will also not include <form> and <option> tags when parsing html. Remember these differences and you will have greater success with XPath compatibility between the browser and Html Agility Pack.

What is the use of HTML parser?

HTML Parser allow you to parse HTML and return an HtmlDocument. Loads an HTML document from a file. Loads the HTML document from the specified string. Gets an HTML document from an Internet resource. Gets an HTML document from a WebBrowser.

What is the Agility Pack used for?

The Agility Pack is standard for parsing HTML content in C#, because it has several methods and properties that conveniently work with the DOM.


1 Answers

You are using InnerText which strips out the HTML.

Use InnerHtml:

string rate = rateNode.InnerHtml;

You can create an XML document from this string (assuming it is valid XML).

You can also query the rateNode in the same way you retrieved it - selecting its child nodes:

var firstRow = rateNode.SelectSingleNode("./table/tbody/tr[0]");
string origin = firstRow.SelectSingleNode("./td[@class = 'origin']");
like image 185
Oded Avatar answered Oct 13 '22 00:10

Oded