Writing an HTML Parser

Tags:

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. ends on or next tag).

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

222

asked Aug 25 '11 14:08

James

2 Answers

The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.

You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.

Suppose your stack is currently just {html}. You encounter a  tag. You look up  in a table that tells you a paragraph must be inside the <body>. Since you're not in the body, you implicitly push <body> onto your stack (or add a body node to your tree). Then you can put the  into the tree.

Now supposed you see another . Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current  off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.

At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.

The trick is to find a good way to represent the context requirements for each element.

answered Oct 20 '22 09:10

Adrian McCarthy

so, I'll try for an answer here -

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

hope that helped!

answered Oct 20 '22 08:10

Andreas Grapentin

Related questions
                            
                                Wiki or Markdown-like syntax for simple forms?
                            
                                What is iframe used for? [closed]
                            
                                XSS attacks and style attributes
                            
                                When does parsing HTML DOM tree happen?
                            
                                Iframe.readyState does not work in chrome
                            
                                visibilitychange event is not triggered when switching program/window with ALT+TAB or clicking in taskbar
                            
                                Table class for tables that, when too wide split all their cells into rows
                            
                                How do I vertically align a div inside a table cell?
                            
                                img width relative to containing div
                            
                                Will Dart support server side development?
                            
                                Prevent Chrome from wrapping contents of joined <p> with a <span>
                            
                                Client only cookies - cookie which doesn't ever go to the server
                            
                                add an onclick event to a div
                            
                                overflow-y not working in safari inside a modal
                            
                                Changing color of jQuery UI Buttons
                            
                                Unobtrusive, self-hosted comments function to put onto existing web pages [closed]
                            
                                HTML5 drag and drop between windows
                            
                                Large file upload with WebSocket
                            
                                Chart.js 2.0: How to change title of tooltip
                            
                                Header Location relative path compatibility

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing an HTML Parser

Tags:

html

parsing

html-parsing

James

People also ask

2 Answers

Adrian McCarthy

Andreas Grapentin

Recent Activity

Donate For Us