Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the optimal regex for parsing HTML (even though you shouldn't)? Is there a perfect one? [closed]

Okay, we all know attempting to parse HTML with Regex brings upon the wrath of Cthulhu. Quite well. And there are some great responses as to why you shouldn't. I accept these, and have posted these links on questions more than once.

But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter. But assume for the moment our developers want to lose their minds to Tony the Pony and take the best shot at doing the impossible. If this blows your mind, assume the question to be theoretical then. Whatever floats your boat. Just consider the idea of parsing HTML with regex, even though you shouldn't.

Here we see a claim that it is not possible to do, at least with perfection. But then there's a very wise comment beneath it from @NikiC:

This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a regular language"). The thing that most people nowadays mean when they say "regex" (PCRE) is well capable not only of parsing context-free grammars (that's trivial actually), but also of context-sensitive grammars (see https://stackoverflow.com/a/7434814/1222420)

Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose. But many make this problem sound like the Halting Problem: you can try, but there will always be another case for which your solution breaks.

So here's the question, and its a bit of a 2-parter.

  • Is it possible to generate a perfect regular expression for parsing HTML?
    • If so, is the proof constructive? Do we only know we can, or has it been done?
  • If it is not possible, what is the most accurate one out there?
like image 354
Nick Avatar asked Aug 21 '12 20:08

Nick


1 Answers

First of all let's get this straight:

Regex' incompatibility with HTML parsing is NOT a claim. Repeat after me: "Not a claim".

It's a scientifically proven and well known fact. Further more the world was not created in 7 days and big-foot ain't real either. End of discussion.


But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter

Funny that you write it doesn't matter. Given, that the "why" is actually what makes it either partially possible or completely impossible what you're planning to do. If there was one thing here that mattered, it'd be the "why".

If the "why" is "validation", then the answer is per definition: not possible. Validation requires no less than 100% language coverage. And regular expressions, being a subset of context-free grammars, therefor cannot cover 100%. By definition.

If the "why" however is "extraction", then you can get quite good results using regex. Never 100% reliable, but good enough for most cases.

Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose.

The sheer length, redundancy and complexity of this pattern shows that while it may not be impossible to describe valid email addresses in regex it at least is disproportionately difficult and does actually rather resemble a brute force dictionary list, than a clean grammar. And while we"re at it: date string validation is even worse. Leap years just to begin with.

To put my differentiation between "validation" and "extraction" into perspective:

To validate a simple email address one needs a monolythic 6400+ chararacters long regular expression.

To "extract" the domain name from an email address however the simple @([^\s]+) or (?<=@)[^\s]+ would cover pretty much (if not exactly) 100%. Assuming the string is isolated and known to be a valid email address.

Is it possible to generate a perfect regular expression for parsing HTML?

You basically answered this one yourself by writing "perfect": No.

If so, is the proof constructive? Do we only know we can, or has it been done?

It's not about "is it just that nobody has managed to do it yet?" but more about "it's been mathematically proven to be impossible!". QED

If it is not possible, what is the most accurate one out there?

Given that it's by definition not possible the only correct answer to this would be "none".

The best approximation of a regex for parsing all (or as much as possible) of HTML would be an infinitely long regex pattern along the lines of x|y|z|… with x, y, z … being all (brute forced) possible productions of the grammar of HTML chained together in an infinitely long logical OR. It would be a proper regex (even to the truest terms of regex), cover all of HTML (it lists and hence matches all possible strings after all), be only theoretically possible (or at least feasible, just like the turing machine) and practically utterly useless.


Regex can describe type-3 Chomsky languages (regular languages), while HTML (and most other programming/markup languages) is a type-2 Chomsky language (context-free). The regular languages are a subset of context-free languages. A grammar of type-n can always cover a subset of a language of type-(n-x), but never all of it. Therefor regex can only describe a subset of HTML. Big-foot is a claim. This is a fact.


Being strictly left or right extended regex have NO understanding of balance or nesting (neither "S→aSa" nor the mixed linear "S→aA,A→Sb,S→ε"). Therefor you can NOT parse HTML.

A quick example for "S→aSa" (balanced nesting):

<div>
    <div>
        ...
    <div>
<div>

Yep, the very core of what's HTML/XML is incompatible with regex. Pretty damn bad position to begin with, isn't it? HTML parsing via regex is literally rotten in its core. Broken by design. Guaranteed to fail.

And one for "S→aA,A→Sb,S→ε" (counting):

It is impossible to validate the correct (matching) number of <td> per row:

<table>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
    </tr>
</table>

Another thing to keep in mind: as soon as you reduce the scope of what you can recognize of language "X" you're NO LONGER recognizing language "X", but a NEW and self-contained subset language "Y".

In the field of languages it is either all or nothing. There is no in between.


Now to those saying PCRE can do it!: Yes, it's called a context-free grammar then.

…by which it not only would no longer be a regular expression and thus fail the test:

we have no option other than Regex

but also still be the wrong tool to begin with. There are dedicated parsers for such tasks. Use 'em.

The email matching regex (as linked by OP) is a nightmare to read, let alone maintain:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(
?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]
|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) ... (6400+ chars)

While here is an excerpt of the very same specification in form of a proper context-free grammar:

 address     =  mailbox                      ; one addressee
             /  group                        ; named list
 group       =  phrase ":" [#mailbox] ";"
 mailbox     =  addr-spec                    ; simple address
             /  phrase route-addr            ; name & addr-spec
 route-addr  =  "<" [route] addr-spec ">"
 route       =  1#("@" domain) ":"           ; path-relative
 addr-spec   =  local-part "@" domain        ; global address
 local-part  =  word *("." word)             ; uninterpreted
                                             ; case-preserved
 domain      =  sub-domain *("." sub-domain)
 sub-domain  =  domain-ref / domain-literal
 domain-ref  =  atom                         ; symbolic reference

Some people, when confronted with HTML, think "I know, I'll use regular expressions."
Now they have two metric f*cktons of problems.


So tell me. Why on earth would anyone really been far even as decided to use even go want to do look more like?

like image 86
Regexident Avatar answered Oct 27 '22 15:10

Regexident