Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Porting a very Pythonesque library over to .NET

I'm investigating the possibility of porting the Python library Beautiful Soup over to .NET. Mainly, because I really love the parser and there's simply no good HTML parsers on the .NET framework (Html Agility Pack is outdated, buggy, undocumented and doesn't work well unless the exact schema is known.)

One of my primary goals is to get the basic DOM selection functionality to really parallel the beauty and simplicity of BeautifulSoup, allowing developers to easily craft expressions to find elements they're looking for.

BeautifulSoup takes advantage of loose-binding and named parameters to make this happen. For example, to find all a tags with an id of test and a title that contains the word foo, I could do:

soup.find_all('a', id='test', title=re.compile('foo'))

However, C# doesn't have a concept of an arbitrary number of named elements. The .NET4 Runtime has named parameters, however they have to match an existing method prototype.

My Question: What is the C# design pattern that most parallels this Pythonic construct?

Some Ideas:

I'd like to go after this based on how I, as a developer, would like to code. Implementing this is out of the scope of this post. One idea I has would be to use anonymous types. Something like:

soup.FindAll("a", new { Id = "Test", Title = new Regex("foo") });

Though this syntax loosely matches the Python implementation, it still has some disadvantages.

  1. The FindAll implementation would have to use reflection to parse the anonymous type, and handle any arbitrary metadata in a reasonable manner.
  2. The FindAll prototype would need to take an Object, which makes it fairly unclear how to use the method unless you're well familiar with the documented behavior. I don't believe there's a way to declare a method that must take an anonymous type.

Another idea I had is perhaps a more .NET way of handling this but strays further away from the library's Python roots. That would be to use a fluent pattern. Something like:

soup.FindAll("a")
    .Attr("id", "Test")
    .Attr("title", new Regex("foo"));

This would require building an expression tree and locating the appropriate nodes in the DOM.

The third and last idea I have would be to use LINQ. Something like:

var nodes = (from n in soup
             where n.Tag == "a" &&
             n["id"] == "Test" &&
             Regex.Match(n["title"], "foo").Success
             select n);

I'd appreciate any insight from anyone with experience porting Python code to C#, or just overall recommendations on the best way to handle this situation.

like image 332
Mike Christensen Avatar asked Nov 13 '22 06:11

Mike Christensen


1 Answers

Have you try to run your code inside the IronPython engine. As far as I know performs really well and you don't have to touch your python code.

like image 128
Ale Miralles Avatar answered Dec 04 '22 21:12

Ale Miralles