I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.
After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)
I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.
So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. <p> ends on </p> or next <p> tag).
Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
Lxml. lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
The best performers are Golang and C with very similar results. Python LIBXML2 performs fairly well. Ruby speed is similar to Python. Java parser tested is slower.
The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.
You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>
, <body>
} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.
Suppose your stack is currently just {html
}. You encounter a <p>
tag. You look up <p>
in a table that tells you a paragraph must be inside the <body>
. Since you're not in the body, you implicitly push <body>
onto your stack (or add a body node to your tree). Then you can put the <p>
into the tree.
Now supposed you see another <p>
. Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current <p>
off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.
At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.
The trick is to find a good way to represent the context requirements for each element.
so, I'll try for an answer here -
basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>
tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.
know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.
Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.
hope that helped!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With