Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression that matches regular expressions

Tags:

regex

Has anyone ever tried to describe regular expression that would match regular expressions?

This topic is almost impossible to find on the web due to the repeating keywords.

It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.

But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.

like image 859
Tom Pažourek Avatar asked Apr 18 '14 20:04

Tom Pažourek


2 Answers

I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.

Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:

The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.

Regular expressions permit nested capture groups, which seem to fall into this category:

Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.

In any case, this may be a better question for the Computer Science Stack Exchange site.

Edit:

¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!

(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)

So it may be technically possible to represent regular expressions as a regular expression.

Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^@]+@[^@]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".

But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.

Having seen this, I wouldn't want to try to write a regex to validate regular expressions.

like image 110
Chris Avatar answered Sep 23 '22 05:09

Chris


I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.

^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$

It can be extended, a looooooot...

like image 23
Pedro Lobito Avatar answered Sep 22 '22 05:09

Pedro Lobito