Is there an XmlReader equivalent for HTML in .Net?

Question

I've used HtmlAgilityPack in the past to parse HTML in .Net but I don't like the fact that it only uses a DOM model.

On large documents and/or those with heavy levels of nesting it is possible to hit stack overflow or out of memory exceptions. Also in general a DOM based parsing model uses significantly more memory than a streaming based approach, typically because the process that wants to consume the HTML may only need a few elements to be available at a time.

Does anyone know of a decent HTML parser for .Net that allows you to parse HTML in a manner similar to the XmlReader class? i.e. in a forward only streaming manner

Mike Mooney · Accepted Answer

I usually use SgmlReader for this: https://github.com/MindTouch/SGMLReader

Like others have said, there are issues in that HTML doesn't follow the same well-formed rules of XML, so it is inherently difficult to parse, but SgmlReader usually does a pretty good job.

jgauffin · Answer

The problem is that HTML can be malformed. And you can't know which tag is missing an end tag (or which tags are placed in the incorrect order) until you have parsed a larger part of the document.

If the documents that you'll parsed is well formed, why don't you use the XmlReader?

Is there an XmlReader equivalent for HTML in .Net?

Tags:

html

.net

parsing

html-agility-pack

xmlreader

RobV

2 Answers

Mike Mooney

jgauffin

Recent Activity

Donate For Us

Is there an XmlReader equivalent for HTML in .Net?

Tags:

html

.net

parsing

html-agility-pack

xmlreader

RobV

2 Answers

Mike Mooney

jgauffin

Related questions

Recent Activity

Donate For Us