Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

Tags:

I have a C# WPF application that needs to consume data that is exposed on a webpage as a HTML table.

After getting inspiration from this url I tried using Linq to Xml to parse the Html document, but this only works if the HTML document is extremely well formed (and doesn't have any comments or HTML entities inside it). I have managed to get a working solution using this technique, but it is far from ideal.

I am after a solution that is intended for parsing HTML. I have hacked "solutions" before, but they are brittle. I am after a robust way of parsing/manipulating the document. I'd ideally like something that makes the task as easy as it would be from Javascript/JQuery.

Does anyone know of a good .Net library or utility for parsing/manipulating HTML?

like image 753
Doctor Jones Avatar asked Feb 12 '09 16:02

Doctor Jones


3 Answers

Even though it's not LINQ based, I suggest researching the HTML Agility Pack from CodePlex.

Note: Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface)

From the HTML Agility Pack page:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

like image 189
LaptopHeaven Avatar answered Sep 24 '22 19:09

LaptopHeaven


There's a LINQ to HTML library here:

http://www.superstarcoders.com/linq-to-html.aspx

like image 27
keith Avatar answered Sep 23 '22 19:09

keith


HTML is rarely well-formed enough that you could reliably use LINQ to XML. It's conceivable that you might find an HTML "cleaner" that could fix the formatting well enough to be read, but there's not telling how robust it would be.

I assume this is a "screenscraper" that reads from an HTML table over which you have no control. Don't stress over robustness in this case, screen-scraping is inherently brittle. If your requirements are set in stone, design the scraper to be easily updatable if/when the HTML you are scraping changes.

like image 31
Dave Swersky Avatar answered Sep 24 '22 19:09

Dave Swersky