I have read that HTMLAgility 1.4
is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project.
I am doing this as a C#
application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags <table class="data">
and </table>
.
My goal is to pull the data for Part-Num
, Manu-Number
, Description
, Manu-Country
, Last Modified
, Last Modified By
, out of the page and send the data to a SQL
table.
One twist is that there is also a small PNG
picture that also need to be grabbed from the src="/partcode/number
.
I do not have any completed code that woks. I thought this bit of code would tell me if I am heading in the right direction. Even stepping into the debug I can’t see that it does anything. Could someone possibly point me in the right direction on this. The more detailed the better since it is apparent I have a lot to learn.
Thank you I would really appreciate it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
namespace Stats
{
class PartParser
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
//My understanding this reads the entire page in?
var tables = doc.DocumentNode.SelectNodes("//table");
// I assume that this sets up the search for words containing table
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
Console.ReadKey();
}
}
}
The web code is:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<title>Part Number Database: Item Record</title>
<table class="data">
<tr><td>Part-Num</td><td width="50"></td><td>
<img src="/partcode/number/072140" alt="072140"/></td></tr>
<tr><td>Manu-Number</td><td width="50"></td><td>
<img src="/partcode/manu/00721408" alt="00721408" /></td></tr>
<tr><td>Description</td><td></td><td>Widget 3.5</td></tr>
<tr><td>Manu-Country</td><td></td><td>United States</td></tr>
<tr><td>Last Modified</td><td></td><td>26 Jan 2009, 8:08 PM</td></tr>
<tr><td>Last Modified By</td><td></td><td>Manu</td></tr>
</table>
<head/>
</html>
The beginning part is off:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
LoadHtml(html)
loads an html string into the document, I think you want something like this instead:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://stackoverflow.com");
A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows
, cells
, and each value inside the case
). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main
method of a Console Application and try to understand it.
HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");
var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Part-Num":
string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Part-Num:\t" + partNum);
break;
case "Manu-Number":
string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Manu-Num:\t" + manuNumber);
break;
case "Description":
string description = valueRow.InnerText;
Console.WriteLine("Description:\t" + description);
break;
case "Manu-Country":
string manuCountry = valueRow.InnerText;
Console.WriteLine("Manu-Country:\t" + manuCountry);
break;
case "Last Modified":
string lastModified = valueRow.InnerText;
Console.WriteLine("Last Modified:\t" + lastModified);
break;
case "Last Modified By":
string lastModifiedBy = valueRow.InnerText;
Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
break;
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With