Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an XmlReader equivalent for HTML in .Net?

I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.

On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.

Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader class? i.e. in a forward only streaming manner

like image 973
RobV Avatar asked Jun 23 '11 10:06

RobV


2 Answers

I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader

Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.

like image 178
Mike Mooney Avatar answered Oct 29 '22 17:10

Mike Mooney


The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.

If the documents that you'll parsed is well formed, why don't you use the XmlReader?

like image 1
jgauffin Avatar answered Oct 29 '22 17:10

jgauffin