Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is recursive regex not regex?

I was reading through some of the responses in this question and saw that a few people said that recursive regular expressions were not strictly speaking regular expressions.

Why is this?

like image 880
Ben Shelock Avatar asked Feb 12 '10 22:02

Ben Shelock


People also ask

Can regular expressions be recursive?

Recursion is mostly used to match balanced constructs. The regex \([^()]*+(?:(? R)[^()]*+)*+\) matches a pair of parentheses with all parentheses between them correctly nested, regardless of how many nested pairs there are or how deeply they are nested. This regex satisfies both requirements for valid recursion.

What is difference between regex and regular expression?

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

Why you should not use regex?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Does Python support regex recursion?

And the answer is YES! Now, it's not possible to do it with the built-in re package in Python because it doesn't support recursive pattern! To solve this problem with a regular expression in Python then, you need to install the regex package which is more compatible with PCRE.


3 Answers

“Strictly” regular expressions describe regular languages. But many features, such as the usage of backreferences in the expression itself or recursion for example, can be used to write regular expressions that accept non-regular languages.

For example, the language described by

(a+)b+\1

isn't regular, as you can't force that a appears the same number of times before and after the bs. At least not in a regular language. With context-free or even context-sensitive languages, that's a completely different matter.

However, regular expressions that only use elementary things such as the various quantifiers, character classes, etc. usually still describe regular languages.

like image 78
Joey Avatar answered Oct 18 '22 10:10

Joey


All regular languages can be recognized by a finite automaton. A finite automaton has a finite number of states, and consequently, finite memory (hence the name). A recursive "regular" expression requires a potentially infinite stack space to do the recursion, thus it is not possible to recognize it with a finite automaton, therefore it is not regular.

like image 15
mmx Avatar answered Oct 18 '22 10:10

mmx


The strict definition of regular language from theoretical computer science may seem abstract with little practical benefit, but if you’re ever faced with the need to implement a state machine to recognize certain inputs, you can save yourself a lot of useless effort and hairpulling if you can prove up front that it can’t be done.

An informal way to express it is recognition of a regular language cannot require an arbitrary amount of memory. The pumping lemma for regular languages is useful for proving that a particular language (i.e., a set of strings) cannot be recognized by a deterministic finite automaton.

From An Introduction to Formal Languages and Automata by Peter Linz (pg. 115, 3rd ed.):

Theorem 4.8

Let L be an infinite regular language. Then there exists some positive integer m such that any w ∈ L with |w| ≥ m can be decomposed as

w = xyz,

with

|xy| ≤ m,

and

|y| ≥ 1,

such that

wi = xyiz — Eq. (4.2)

is also in L for all i = 0, 1, 2, …

To recognize an infinite language, a finite automaton must “pump” or repeat some portion of its states, and that’s the function of yi (notation for some string y repeated i times).

Very nearly all proofs involving the pumping lemma involve proof by contradiction. On page 117, the author proves that the language L = { anbn : n ≥ 0 }—i.e., strings of the form aaa…bbb… where the all-a and all-b substrings are equal in length—is not regular:

Assume that L is regular, so that the pumping lemma must hold. We do not know the value of m, but whatever it is, we can always choose n = m. Therefore, the substring y must consist entirely of a's. Suppose |y| = k. Then the string obtained by using i = 0 in Equation (4.2) is

w0 = am-kbm

and is clearly not in L. This contradicts the pumping lemma and thereby indicates that the assumption that L is regular must be false.

Other examples of languages that are not regular:

  • L = { wwR : w ∈ Σ* } — i.e., palindromes
  • L = { w ∈ Σ* : na(w) < nb(w) } — i.e., number of as fewer than number of bs
  • L = { an! : n ≥ 0 }
  • L = { anbl : nl }
  • L = { anbl : n + l is a prime number }

It turns out that what we loosely call regular expressions are considerably more powerful: matching regular expressions with backreferences is NP-hard!

like image 11
Greg Bacon Avatar answered Oct 18 '22 09:10

Greg Bacon