Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Html Agility Pack be used to parse HTML fragments?

I need to get LINK and META elements from ASP.NET pages, user controls and master pages, grab their contents and then write back updated values to these files in a utility I'm working on.

I could try using regular expressions to grab just these elements but there are several issues with that approach:

  • I expect many of the input files to contain broken HTML (missing / out-of-sequence elements, etc.)
  • SCRIPT elements that contain comments and/or VBScript/JavaScript that looks like valid elements, etc.
  • I need to be able to special-case IE conditional comments and META and LINK elements inside IE conditional comments
  • Not to mention how HTML is not a regular language

I did some research for HTML parsers in .NET and many SO posts and blogs recommend the HTML Agility Pack. I've never used it before and I don't know if it can parse broken HTML and HTML fragments. (For example, imagine a user control that only contains a HEAD element with some content in it - no HTML or BODY.) I know I could read the documentation but it'd save me quite a bit of time if someone could advise. (Most SO posts involve parsing full HTML pages.)

like image 374
xxbbcc Avatar asked Sep 21 '12 14:09

xxbbcc


1 Answers

Absolutely, that is what it excels at.

In fact, many web pages you'll find in the wild could be described as HTML fragments, due to missing <html> tags, or improperly closed tags.

The HtmlAgilityPack simulates what the browser has to do - try to make sense from what is sometimes a jumble of mismatched tags. An imperfect science, but HtmlAgilgityPack does it very well.

like image 191
D'Arcy Rittich Avatar answered Sep 21 '22 04:09

D'Arcy Rittich