Ruby's regular expressions have a feature called atomic grouping (?>regexp)
, described here, is there any equivalent in Python's re
module?
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters 'c', 'a', and 't'.
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing.
Python does not directly support this feature, but you can emulate it by using a zero-width lookahead assert ((?=RE)
), which matches from the current point with the same semantics you want, putting a named group ((?P<name>RE)
) inside the lookahead, and then using a named backreference ((?P=name)
) to match exactly whatever the zero-width assertion matched. Combined together, this gives you the same semantics, at the cost of creating an additional matching group, and a lot of syntax.
For example, the link you provided gives the Ruby example of
/"(?>.*)"/.match('"Quote"') #=> nil
We can emulate that in Python as such:
re.search(r'"(?=(?P<tmp>.*))(?P=tmp)"', '"Quote"') # => None
We can show that I'm doing something useful and not just spewing line noise, because if we change it so that the inner group doesn't eat the final "
, it still matches:
re.search(r'"(?=(?P<tmp>[A-Za-z]*))(?P=tmp)"', '"Quote"').groupdict() # => {'tmp': 'Quote'}
You can also use anonymous groups and numeric backreferences, but this gets awfully full of line-noise:
re.search(r'"(?=(.*))\1"', '"Quote"') # => None
(Full disclosure: I learned this trick from perl's perlre
documentation, which mentions it under the documentation for (?>...)
.)
In addition to having the right semantics, this also has the appropriate performance properties. If we port an example out of perlre
:
[nelhage@anarchique:~/tmp]$ cat re.py import re import timeit re_1 = re.compile(r'''\( ( [^()]+ # x+ | \( [^()]* \) )+ \) ''', re.X) re_2 = re.compile(r'''\( ( (?=(?P<tmp>[^()]+ ))(?P=tmp) # Emulate (?> x+) | \( [^()]* \) )+ \)''', re.X) print timeit.timeit("re_1.search('((()' + 'a' * 25)", setup = "from __main__ import re_1", number = 10) print timeit.timeit("re_2.search('((()' + 'a' * 25)", setup = "from __main__ import re_2", number = 10)
We see a dramatic improvement:
[nelhage@anarchique:~/tmp]$ python re.py 96.0800571442 7.41481781006e-05
Which only gets more dramatic as we extend the length of the search string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With