Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are regular expressions (regex) really regular?

I understand how regular expressions got their name, and have read the related question (Why are regular expressions called "regular" expressions?), but am still wondering whether regular expressions are always regular.

For example, how can back-references be regular? Would that not require some memory and thus be impossible to match/generate by a finite state automaton?

like image 390
Igor Ševo Avatar asked Mar 29 '16 11:03

Igor Ševo


2 Answers

The link in the answer to the question you reference states (in wikipedia), as opposed to many regular expressions engines provided by modern programming languages, which are augmented with features that allow recognition of languages that cannot be expressed by a classic regular expression.

So I would say that the evolution of regex moved it away from it's original idea of expressing regular languages.

From the Wikipedia article on regular expressions:

Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is (.+)\1.

like image 137
SamWhan Avatar answered Sep 21 '22 20:09

SamWhan


Modern extensions including back-references make the regex systems not a candidate of regular languages, however IMO they can be lifted to context-free languages but not to Turing machines.

Regular grammars share a common property called pumping lemma. You can check the example here which proves 0n1n is not a regular grammar (that is quite similar to back-references). Here is how it can be shown that back-references don't satisfy pumping lemma property.

  • Pumping lemma in current context: to show that a regex system is regular grammar, there needs to be a finite length p such that all strings that match the regex and have length equals to or greater than p can be split up in three substrings xyz such that y is not an empty string and all the strings represented by xy*z (y pumped for [0, infinite) times) match with the regex.

  • If we can show no such p can satisfy the conditions for the regex then it is not in regular grammar.

  • For back-references we will need to have two of these pumping strings that too of equal length, one for the sub-pattern in captured group and one in the backreference. This is exactly what the push-down automata or context free languages are. There is also a pumping lemma for context free grammars that is based on splitting into uvwxy where v and x can be pumped equally n times. We can show that the regex with back-references system satisfies this lemma.

like image 30
Utkarsh Bhardwaj Avatar answered Sep 19 '22 20:09

Utkarsh Bhardwaj