Ruby's regular expressions have a feature called atomic grouping <code>(?>regexp)</code>, described here, is there any equivalent in Python's <code>re</code> module?

Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert (<code>(?=RE)</code>), which matches from the current point with the same semantics you want, putting a named group (<code>(?P<name>RE)</code>) inside the lookahead, and then using a named backreference (<code>(?P=name)</code>) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax. For example, the link you provided gives the Ruby example of <pre class="prettyprint"><code>/"(?>.*)"/.match('"Quote"') #=> nil </code></pre> We can emulate that in Python as such: <pre class="prettyprint"><code>re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None </code></pre> We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final <code>"</code>, it still matches: <pre class="prettyprint"><code>re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict() # => {'tmp': 'Quote'} </code></pre> You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise: <pre class="prettyprint"><code>re.search(r'"(?=(.*))\1"', '"Quote"') # => None </code></pre> (Full disclosure: I learned this trick from perl's <code>perlre</code> documentation, which mentions it under the documentation for <code>(?>...)</code>.) In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of <code>perlre</code>: <pre class="prettyprint"><code>[nelhage@anarchique:~/tmp]$ cat re.py import re import timeit re_1 = re.compile(r'''$ ( [^()]+ # x+ | \( [^()]* $ )+ \) ''', re.X) re_2 = re.compile(r'''$ ( (?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+) | \( [^()]* $ )+ \)''', re.X) print timeit.timeit("re_1.search('((()' + 'a' * 25)", setup = "from __main__ import re_1", number = 10) print timeit.timeit("re_2.search('((()' + 'a' * 25)", setup = "from __main__ import re_2", number = 10) </code></pre> We see a dramatic improvement: <pre class="prettyprint"><code>[nelhage@anarchique:~/tmp]$ python re.py 96.0800571442 7.41481781006e-05 </code></pre> Which only gets more dramatic as we extend the length of the search string.

Do Python regular expressions have an equivalent to Ruby's atomic grouping?

1 Answers

Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)) inside the lookahead, and then using a named backreference ((?P=name)) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.

For example, the link you provided gives the Ruby example of

/"(?>.*)"/.match('"Quote"') #=> nil

We can emulate that in Python as such:

re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None

We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final ", it still matches:

re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict() # => {'tmp': 'Quote'}

You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:

re.search(r'"(?=(.*))\1"', '"Quote"') # => None

(Full disclosure: I learned this trick from perl's perlre documentation, which mentions it under the documentation for (?>...).)

In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre:

[nelhage@anarchique:~/tmp]$ cat re.py import re import timeit   re_1 = re.compile(r'''\(                            (                              [^()]+           # x+                            |                              \( [^()]* \)                            )+                        \)                    ''', re.X) re_2 = re.compile(r'''\(                            (                              (?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+)                            |                              \( [^()]* \)                            )+                        \)''', re.X)  print timeit.timeit("re_1.search('((()' + 'a' * 25)",                     setup  = "from __main__ import re_1",                     number = 10)  print timeit.timeit("re_2.search('((()' + 'a' * 25)",                     setup  = "from __main__ import re_2",                     number = 10)

We see a dramatic improvement:

[nelhage@anarchique:~/tmp]$ python re.py 96.0800571442 7.41481781006e-05

Which only gets more dramatic as we extend the length of the search string.

124

answered Sep 22 '22 13:09

nelhage

Related questions
                            
                                RuntimeWarning: invalid value encountered in greater
                            
                                Taking subarrays from numpy array with given stride/stepsize
                            
                                Django - run a function every x seconds
                            
                                How can I bypass the Google CAPTCHA with Selenium and Python?
                            
                                Full command line as it was typed
                            
                                Python - import in if
                            
                                Why is if True slower than if 1?
                            
                                Efficiently create sparse pivot tables in pandas?
                            
                                Pandas KeyError: value not in index
                            
                                Can I omit Optional if I set default to None?
                            
                                What class to use for money representation?
                            
                                graph rendering in python (flowchart visualization) [closed]
                            
                                Running maximum of numpy array values
                            
                                How to create datetime object from "16SEP2012" in python
                            
                                What does ,= mean in python?
                            
                                Set Max value for color bar on seaborn heatmap
                            
                                in Windows 10, How to configure Visual Studio Code to find the Python 3 interpreter?
                            
                                How to install python3.7 and create a virtualenv with pip on Ubuntu 18.04?
                            
                                How to auto insert the current user when creating an object in django admin?
                            
                                Fabric's cd context manager does not work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do Python regular expressions have an equivalent to Ruby's atomic grouping?

Tags:

python

regex

ruby

Alex Gaynor

People also ask

1 Answers

nelhage

Recent Activity

Donate For Us