Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The fundamental reason why regex and HTML don't mix? The theory behind it?

To start with, I cannot do anything but refer to what I believe is the most famous SO post ever:

RegEx match open tags except XHTML self-contained tags

Now, is it even a question for StackOverflow? I don't know, but I'll try...

I'll speak from a personal point of view. While I've never had to do that, I know that the day I have to parse HTML, I will certainly not go with regexes; I'll try and find an HTML parsing library. Fine.

But I don't know why.

At one point, I decided to do CSS validation in Java. I knew "by the guts" that regexes wouldn't cut it, so I used Parboiled.

And I don't know why.

The "why" troubles me. I am no newbie with regexes at all. I just can't put a clear line between what regex engines can, and cannot do.

My question is the following: what is this clear line? What fundamental characteristic of an input must exist so that it is mathematically demonstrated that any regex engine cannot reliably determine success and failure?

Can you give a simple, theoretical input which would spell failure as to a regex engine's ability to give a reliable "match/no match" answer? If yes, what is the defining characteristic of such an input?

EDIT For the sake of this discussion, I'll add a task suggested by a post on SO (which I can't find the link to at the moment, sorry) which is simpler than HTML, but for which I won't use regexes: shell command line parsing.

As far as the shell is concerned, those are equivalent:

alias ll="ls -l"
alias ll=ls\ -l
alias l"l"=ls' -'l
"alia"s l"l= "ls\ -l

Shell quoting mechanisms are so numerous that I'll just create a Parboiled grammar in this case... But this is "out of my guts". Because I find it easier probably... But that doesn't prove that this is not feasible with regexes.

like image 697
fge Avatar asked Jun 11 '13 22:06

fge


People also ask

Why can't HTML be parsed with regex?

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can you use regex in HTML?

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine.

Why is HTML not a regular language?

HTML is used for structural purposes on a web page, not functional ones. Programming languages have functional purposes. HTML, as a markup language doesn't really “do” anything in the sense that a programming language does. HTML contains no programming logic.

Why is regex so complicated?

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.


2 Answers

Regular expressions can determine regular languages. But HTML is not a regular language. It is a context-free language. Context-free languages are a superset of regular languages.

Basically any language that can have recursive elements in it is not regular. Regular languages have to be "flat", so there can be no nesting. In HTML, for example, one <div> can be nested inside another, and there is no limit to the depth they can be nested. It is that type of general nesting that regular expressions can't deal with.

like image 198
recursive Avatar answered Sep 28 '22 17:09

recursive


Regular expressions are mostly to match a given pattern against an input string and see if that succeeds. That's their primary goal. RE libraries offer additional features like getting subparts of an input string based on the match, but that's feasible only for few parts. If you are going to need a full representation of your input you need a parse tree. Every parser can easily generate this for you, since this is one of their tasks. With RE you have too do this manually.

Another point is the complexity of your expression if you would use regular expressions. Difficult to test for errors and you mostly get all or nothing, either it matches successfully (and you get your desired info) or you get nothing and have to find what's wrong with it. Using a parser generator you can interactively build your grammar to get more and more info, not to mention that you probably find an HTML grammar for every relevant parser out there already.

Finally, don't forget feedback for invalid input. With RE you get nothing. With a parser you get error messages that point you to the actual problem. Some parsers (like those generated by ANTLR) even can cope with simple syntax errors and still generate a usable parse tree for you.

like image 41
Mike Lischke Avatar answered Sep 28 '22 16:09

Mike Lischke