Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perform Web Data Extraction

I have installed HTMLAgilityPack, but I cannot grasp once capturing the document table how to extract the row whose first td element contains todays date in the format dd-mmm-yy.

Can anybody point me in the right direction with a code snippet?

At present I have:

HtmlDocument doc = new HtmlDocument();
doc.Load("http://lbma.org.uk/pages/printerFriendly.cfm?thisURL=index.cfm&title=gold_fixings&page_id=53&show=2012&type=daily");
foreach(HtmlNode tr in doc.DocumentNode.SelectNodes("tr"))
{
            
}
like image 437
Gravy Avatar asked Jan 22 '26 13:01

Gravy


1 Answers

Fun. That page is horribly malformed Html, so I can see your problem. Still, I wouldn't touch XPath with a 10-foot pole. Linq makes life so much easier.

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://lbma.org.uk/pages/printerFriendly.cfm?thisURL=index.cfm&title=gold_fixings&page_id=53&show=2012&type=daily");

HtmlNode todaysRow = doc.DocumentNode.Descendants("tr").Where(n => n.InnerText.StartsWith(string.Format("{0:dd-MMM-yy}", DateTime.Today), StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
if (todaysRow != null)
{
    List<HtmlNode> cells = todaysRow.Descendants("td").ToList();
    decimal usd = decimal.Parse(cells[1].FirstChild.InnerText);
    decimal gbp = decimal.Parse(cells[2].FirstChild.InnerText);
    // ... etc 
} 
like image 171
Jacob Proffitt Avatar answered Jan 25 '26 07:01

Jacob Proffitt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!