It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE). Python <code>re</code> module reference does not seem to specify.

Neither. It's basically the PCRE dialect, but a distinct implementation. The very first sentence in the <code>re</code> documentation says: <blockquote> This module provides regular expression matching operations similar to those found in Perl. </blockquote> While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for <code>grep -E</code> aka ERE. The <code>perlre</code> manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history: <blockquote> The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) </blockquote> (Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.) Perl 4 had a large number of convenience constructs like <code>\d</code>, <code>\s</code>, <code>\w</code> as well as symbolic shorthands like <code>\t</code>, <code>\f</code>, <code>\n</code>. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to, <ul> <li>Non-greedy quantifiers</li> <li>Non-backtracking quantifiers</li> <li>Unicode symbol and property support</li> <li>Non-grouping parentheses</li> <li>Lookaheads and lookbehinds</li> <li>... Basically anything that starts with <code>(?</code> </li> </ul> As a result, the "regular" expressions are by no means strictly "regular" any longer. This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes <code>*</code> to something like <code>{0,32767}</code> while PCRE does something else.) An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.

Except for some similarity in the syntax, <code>re</code> module doesn't follow POSIX standard for regular expressions. <h3>Different matching semantics</h3> POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while <code>re</code> module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression). The difference in the matching semantics can be observed in the case of matching <code>(Prefix|PrefixSuffix)</code> against <code>PrefixSuffix</code>. <ul> <li>In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match <code>PrefixSuffix</code>.</li> <li>In contrast, <code>re</code> engine (and many other backtracking regex engines) will match <code>Prefix</code> only, since <code>Prefix</code> is specified first in the alternation.</li> </ul> The difference can also be seen in the case of matching <code>(xxx|xxxxx)*</code> against <code>xxxxxxxxxx</code> (a string of 10 <code>x</code>'s): <ul> <li> On Cygwin: <pre class="prettyprint"><code>$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}" xxxxxxxxxx </code></pre> All 10 <code>x</code>'s are matched. </li> <li> In Python: <pre class="prettyprint"><code>>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0) 'xxxxxxxxx' </code></pre> Only 9 <code>x</code>'s are matched, since it picks the first item in alternation <code>xxx</code> in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation) </li> </ul> <h3>POSIX-exclusive regular expression features</h3> Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex. Taking equivalence class expression as example, from the documentation: <blockquote> An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( <code>"[="</code> and <code>"=]"</code> ) delimiters. For example, if <code>'a'</code>, <code>'à'</code>, and <code>'â'</code> belong to the same equivalence class, then <code>"[[=a=]b]"</code>, <code>"[[=à=]b]"</code>, and <code>"[[=â=]b]"</code> are each equivalent to <code>"[aàâb]"</code>. [...] </blockquote> Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order. <h3> <code>re</code> regular expression features</h3> <code>re</code> borrows the syntax from Perl, but not all features in Perl regex are implemented in <code>re</code>. Below are some regex features available in <code>re</code> which is unavailable in POSIX regular expression: <ul> <li> Greedy/lazy quantifier, which specifies the order to expand a quantifier. While people usually call the <code>*</code> in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule. </li> <li>Look-around assertion (look-ahead and look-behind)</li> <li>Conditional pattern <code>(?(id/name)yes-pattern|no-pattern)</code> </li> <li>Short-hand constructs: <code>\b</code>, <code>\s</code>, <code>\d</code>, <code>\w</code> (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)</li> </ul>

Does the Python regular expression module use BRE or ERE?

2 Answers

Neither. It's basically the PCRE dialect, but a distinct implementation.

The very first sentence in the re documentation says:

This module provides regular expression matching operations similar to those found in Perl.

While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for grep -E aka ERE.

The perlre manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history:

The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)

(Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.)

Perl 4 had a large number of convenience constructs like \d, \s, \w as well as symbolic shorthands like \t, \f, \n. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to,

Non-greedy quantifiers
Non-backtracking quantifiers
Unicode symbol and property support
Non-grouping parentheses
Lookaheads and lookbehinds
... Basically anything that starts with (?

As a result, the "regular" expressions are by no means strictly "regular" any longer.

This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes * to something like {0,32767} while PCRE does something else.)

An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.

answered Oct 03 '22 21:10

tripleee

Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.

Different matching semantics

POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).

The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.

In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.

The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):

On Cygwin:

$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx

All 10 x's are matched.

In Python:
```
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
```
Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)

POSIX-exclusive regular expression features

Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.

Taking equivalence class expression as example, from the documentation:

An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]

Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.

`re` regular expression features

re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:

Greedy/lazy quantifier, which specifies the order to expand a quantifier.

^{While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.}
Look-around assertion (look-ahead and look-behind)
Conditional pattern (?(id/name)yes-pattern|no-pattern)
Short-hand constructs: \b, \s, \d, \w (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)

answered Oct 03 '22 21:10

nhahtdh

Related questions
                            
                                Broadcast to all connected clients except sender with python flask socketio
                            
                                IOError: [Errno 13] Permission denied
                            
                                sqlalchemy dynamic schema on entity at runtime
                            
                                Displaying networkx graph with labels
                            
                                Visualizing an LDA model, using Python
                            
                                Pandas: Printing the Names and Values in a Series
                            
                                How to add header in requests
                            
                                How do I setup dependent factories using Factory Boy and Flask-SQLAlchemy?
                            
                                how to solve "bad interpreter: Too many levels of symbolic links"
                            
                                Embed python in to iOS (iphone) app written in Objective-C/Swift/C/C++ (whatever language i can compile in Xcode and bridge to iOS) [closed]
                            
                                Amazon EMR Pyspark Module not found
                            
                                Move non-empty cells to the left in pandas DataFrame
                            
                                Python requests gives SSL unknown protocol
                            
                                Can I patch a static method in python?
                            
                                Default value of Django's model doesn't appear in SQL
                            
                                Django reset auto-increment pk/id field for production
                            
                                Pycharm IPython tab completion not working (within python console)
                            
                                How to use a conditional statement based on DataFrame boolean value in pandas
                            
                                Return single cell value from Pandas DataFrame
                            
                                Subtracting numpy arrays of different shape efficiently

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does the Python regular expression module use BRE or ERE?

Tags:

python

regex

posix

tarabyte