Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).

This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.

One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.

Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.

like image 798
Andy E Avatar asked Mar 08 '10 10:03

Andy E


People also ask

Should I use regex to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

How does parsing work in HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

Why you should not use regex?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Do parsers use regex?

Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.


1 Answers

So how does a HTML parser work? Doesn't it use regular expressions to parse?

Well, no.

If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.

The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.

A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.

A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.

Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.

The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.

To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.

HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.

like image 99
JXG Avatar answered Sep 27 '22 02:09

JXG