Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iron python, beautiful soup, win32 app

Tags:

Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?

like image 508
Vasil Avatar asked Sep 23 '08 01:09

Vasil


1 Answers

I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.

Edit

Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:

using System; using System.Collections.Generic; using HtmlAgilityPack;  namespace GovParsingTest {     class Program     {         static void Main(string[] args)         {             HtmlWeb hw = new HtmlWeb();             string url = @"http://www.house.gov/house/House_Calendar.shtml";             HtmlDocument doc = hw.Load(url);              HtmlNode docNode = doc.DocumentNode;             HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");             HtmlNodeCollection tableRows = div.SelectNodes(".//tr");              foreach (HtmlNode row in tableRows)             {                 HtmlNodeCollection cells = row.SelectNodes(".//td");                 HtmlNode dateNode = cells[0];                 HtmlNode eventNode = cells[1];                  while (eventNode.HasChildNodes)                 {                     eventNode = eventNode.FirstChild;                 }                  Console.WriteLine(dateNode.InnerText);                 Console.WriteLine(eventNode.InnerText);                 Console.WriteLine();             }              //Console.WriteLine(div.InnerHtml);             Console.ReadKey();         }     } } 
like image 58
bouvard Avatar answered Sep 24 '22 06:09

bouvard