The fundamental reason why regex and HTML don't mix? The theory behind it?

Tags:

To start with, I cannot do anything but refer to what I believe is the most famous SO post ever:

RegEx match open tags except XHTML self-contained tags

Now, is it even a question for StackOverflow? I don't know, but I'll try...

I'll speak from a personal point of view. While I've never had to do that, I know that the day I have to parse HTML, I will certainly not go with regexes; I'll try and find an HTML parsing library. Fine.

But I don't know why.

At one point, I decided to do CSS validation in Java. I knew "by the guts" that regexes wouldn't cut it, so I used Parboiled.

And I don't know why.

The "why" troubles me. I am no newbie with regexes at all. I just can't put a clear line between what regex engines can, and cannot do.

My question is the following: what is this clear line? What fundamental characteristic of an input must exist so that it is mathematically demonstrated that any regex engine cannot reliably determine success and failure?

Can you give a simple, theoretical input which would spell failure as to a regex engine's ability to give a reliable "match/no match" answer? If yes, what is the defining characteristic of such an input?

EDIT For the sake of this discussion, I'll add a task suggested by a post on SO (which I can't find the link to at the moment, sorry) which is simpler than HTML, but for which I won't use regexes: shell command line parsing.

As far as the shell is concerned, those are equivalent:

alias ll="ls -l"
alias ll=ls\ -l
alias l"l"=ls' -'l
"alia"s l"l= "ls\ -l

Shell quoting mechanisms are so numerous that I'll just create a Parboiled grammar in this case... But this is "out of my guts". Because I find it easier probably... But that doesn't prove that this is not feasible with regexes.

697

asked Jun 11 '13 22:06

fge

2 Answers

Regular expressions can determine regular languages. But HTML is not a regular language. It is a context-free language. Context-free languages are a superset of regular languages.

Basically any language that can have recursive elements in it is not regular. Regular languages have to be "flat", so there can be no nesting. In HTML, for example, one <div> can be nested inside another, and there is no limit to the depth they can be nested. It is that type of general nesting that regular expressions can't deal with.

198

answered Sep 28 '22 17:09

recursive

Regular expressions are mostly to match a given pattern against an input string and see if that succeeds. That's their primary goal. RE libraries offer additional features like getting subparts of an input string based on the match, but that's feasible only for few parts. If you are going to need a full representation of your input you need a parse tree. Every parser can easily generate this for you, since this is one of their tasks. With RE you have too do this manually.

Another point is the complexity of your expression if you would use regular expressions. Difficult to test for errors and you mostly get all or nothing, either it matches successfully (and you get your desired info) or you get nothing and have to find what's wrong with it. Using a parser generator you can interactively build your grammar to get more and more info, not to mention that you probably find an HTML grammar for every relevant parser out there already.

Finally, don't forget feedback for invalid input. With RE you get nothing. With a parser you get error messages that point you to the actual problem. Some parsers (like those generated by ANTLR) even can cope with simple syntax errors and still generate a usable parse tree for you.

answered Sep 28 '22 16:09

Mike Lischke

Related questions
                            
                                How to create a favicon in javascript?
                            
                                How do you set the HTML font size in graphviz?
                            
                                Sublime 2, how to auto close HTML tags and place cursor inside the tag
                            
                                less mixin or selector to change position based on sibling index?
                            
                                <doctype html> is messing up my CSS
                            
                                Globally prevent text selection in Firefox
                            
                                Table Manners - Do I need to be consistent?
                            
                                How to configure TinyMCE to allow block-level elements inside anchor (a) tag?
                            
                                cannot set video.currentTime in html5 video using jquery/javascript
                            
                                CSS - absolute within fixed with overflow for scrolling and 100% height
                            
                                Multiple Views in the same page MVC ASP.NET
                            
                                Trapezoid in HTML with CSS3 Gradient applied to it
                            
                                How to overlay a play button over a video thumbnail image?
                            
                                Fill input file form with JavaScript [duplicate]
                            
                                HTML 5 + Progress element check?
                            
                                Is there a way to copy the path that appears in Chrome's dev tools after Inspecting an Element?
                            
                                lxml.html parsing with XPath and variables
                            
                                use tags <script> with if and else
                            
                                Inline PHP / HTML Ternary If
                            
                                Responsive Bootstrap Sticky Footer Height

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The fundamental reason why regex and HTML don't mix? The theory behind it?

Tags:

html

regex

pattern-matching

parsing

fge

People also ask

2 Answers

recursive

Mike Lischke

Recent Activity

Donate For Us