Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do extended regexes support back-references?

Tags:

c

regex

posix

Wikipedia says that extended regexes "dropped support for backreferences", so the "basic" regex mode has to be used to enable those. However, it seems that a number of implementations do support backreferences for extended regexes. For example, with gcc 4.6 on Ubuntu Precise, they are supported. FreeBSD implementations seem to support them only in basic mode.

Boost says (and seems to agree with Wikipedia) that backreferences are not supported for extended regexes, but Boost::Regex adds them as an extension.

Is this just a poorly defined part of the standard which is interpreted differently by every implementation?

like image 513
Eli Bendersky Avatar asked Nov 10 '12 14:11

Eli Bendersky


People also ask

What is back reference?

back-references are regular expression commands which refer to a previous part of the matched regular expression. Back-references are specified with backslash and a single digit (e.g. ' \1 '). The part of the regular expression they refer to is called a subexpression, and is designated with parentheses.

What are extended regular expressions?

An extended regular expression specifies a set of strings to be matched. The expression contains both text characters and operator characters. Text characters match the corresponding characters in the strings being compared. Operator characters specify repetitions, choices, and other features.

What is Posix in regex?

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets.

What is Posix ere?

POSIX Extended Regular ExpressionsThe Extended Regular Expressions or ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command. “Extended” is relative to the original UNIX grep, which only had bracket expressions, dot, caret, dollar and star. An ERE support these just like a BRE.


1 Answers

As others have already pointed out, it's quite clear that POSIX EREs do not support back-references.

The rationale given in the OpenGroup Base Specifications Issue 7 for not adding back-references to EREs is given as:

It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.

Quoted from: Rationale: Base Definitions: Extended Regular Expressions

The primary reason for this limitation is to allow POSIX EREs to be converted to a deterministic finite automata (DFA), and indeed the original implementation of EREs in Unix was done as a DFA. The use of a DFA allows guarantees to be made about the performance of the implementation. Pattern matching with (an unbounded number of) back references is an NP-hard problem, and perhaps even an NP-complete problem. Consensus in the POSIX standards committee could never have been reached if back-references were proposed for EREs because that would force all the companies using the original Unix implementation to change their code to a non-deterministic implementation and to drop their performance guarantees, and some of those companies had members on the committee.

It has also been noted that back-references in REs are not intuitive for either users or implementors, and indeed they've caused extreme confusion more often than now. See for example the examples given in RE-Interpretation: The Dark Corners

NOTE: back-references in REs are not the same as references to sub-patterns in substitution text in tools such as sed.

like image 129
Greg A. Woods Avatar answered Oct 09 '22 22:10

Greg A. Woods