Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do HTML parsers work?

I've seen the humorous threads and read the warnings, and I know that you don't parse HTML with regex. Don't worry... I'm not planning on trying it.

BUT... that leads me to ask: how are HTML parsers coded (including the built-in functions of programming languages, like DOM parsers and PHP's strip_tags)? What mechanism do they employ to parse the (sometimes malformed) markup?

I found the source of one coded in JavaScript, and it actually uses regex to do the job:

// Regular Expressions for parsing tags and attributes
var startTag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/,
    endTag = /^<\/(\w+)[^>]*>/,
    attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;  

Do they all do this? Is there a conventional, standard way to code an HTML parser?

like image 448
Peter Avatar asked Feb 18 '11 06:02

Peter


1 Answers

I do not know that that style is a “normal” way to do things. It is better than most I’ve seen, but it’s still too close to what I refer to as a “naïve” approach in this answer. For one thing, it isn’t accounting for HTML comments getting in the way of things. There are also legal but somewhat matters of entities it isn’t dealing with. But it’s HTML comments where most such approaches fall down.

A more natural way is to use a lexer to peel off tokens, more like like shown in this answer’s script, then assemble those meaningfully. The lexer would be able to know about the HTML comments easily enough.

You could approach this with a full grammar, such as the one shown here for parsing an RFC 5322 mail address. That is the sort of approach I take in the second, “wizardly” solution in this answer. But even that is only a complete grammar for well-formed HTML, and I’m only interested in a few different sort of tags. Those I define fully, but I don’t define valid fields for tags I’m unconcerned with.

like image 184
tchrist Avatar answered Sep 21 '22 15:09

tchrist