I understand how regular expressions got their name, and have read the related question (Why are regular expressions called "regular" expressions?), but am still wondering whether regular expressions are always regular.
For example, how can back-references be regular? Would that not require some memory and thus be impossible to match/generate by a finite state automaton?
The link in the answer to the question you reference states (in wikipedia), as opposed to many regular expressions engines provided by modern programming languages, which are augmented with features that allow recognition of languages that cannot be expressed by a classic regular expression.
So I would say that the evolution of regex moved it away from it's original idea of expressing regular languages.
From the Wikipedia article on regular expressions:
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is
(.+)\1
.
Modern extensions including back-references make the regex systems not a candidate of regular languages, however IMO they can be lifted to context-free languages but not to Turing machines.
Regular grammars share a common property called pumping lemma. You can check the example here which proves 0n1n is not a regular grammar (that is quite similar to back-references). Here is how it can be shown that back-references don't satisfy pumping lemma property.
Pumping lemma in current context: to show that a regex system is regular grammar, there needs to be a finite length p such that all strings that match the regex and have length equals to or greater than p can be split up in three substrings xyz such that y is not an empty string and all the strings represented by xy*z (y pumped for [0, infinite) times) match with the regex.
If we can show no such p can satisfy the conditions for the regex then it is not in regular grammar.
For back-references we will need to have two of these pumping strings that too of equal length, one for the sub-pattern in captured group and one in the backreference. This is exactly what the push-down automata or context free languages are. There is also a pumping lemma for context free grammars that is based on splitting into uvwxy where v and x can be pumped equally n times. We can show that the regex with back-references system satisfies this lemma.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With