Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the maximum number of repetitions allowed in a Python regex?

Tags:

python

regex

In Python 2.7 and 3, the following works:

>>> re.search(r"a{1,9999}", 'aaa')
<_sre.SRE_Match object at 0x1f5d100>

but this gives an error:

>>> re.search(r"a{1,99999}", 'aaa')
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/lib/python2.7/re.py", line 142, in search
   return _compile(pattern, flags).search(string)
   File "/usr/lib/python2.7/re.py", line 240, in _compile
   p = sre_compile.compile(pattern, flags)
   File "/usr/lib/python2.7/sre_compile.py", line 523, in compile
   groupindex, indexgroup
RuntimeError: invalid SRE code

It seems like there is an upper limit on the number of repetitions allowed. Is this part of the regular expression specification, or a Python-specific limitation? If Python-specific, is the actual number documented somewhere, and does it vary between implementations?

like image 827
mojones Avatar asked Nov 04 '13 17:11

mojones


People also ask

What is regex in Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

What is r in regex Python?

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

What are different types of regular expression in Python?

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

What does re match return Python?

Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none.


1 Answers

A quick manual binary search revealed the answer, specifically 65535:

>>> re.search(r"a{1,65535}", 'aaa')
<_sre.SRE_Match object at 0x2a9a68>
>>> 
>>> re.search(r"a{1,65536}", 'aaa')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/re.py", line 240, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/sre_compile.py", line 523, in compile
    groupindex, indexgroup
OverflowError: regular expression code size limit exceeded

This is discussed here:

The limit is an implementation detail. The pattern is compiled into codes which are then interpreted, and it just happens that the codes are (usually) 16 bits, giving a range of 0..65535, but it uses 65535 to represent no limit and doesn't warn if you actually write 65535.

and

The quantifiers use 65535 to represent no upper limit, so ".{0,65535}" is equivalent to ".*".


Thanks to the authors of the comments below for pointing a few more things out:

  • CPython implements this limitation in _sre.c. (@LukasGraf)
  • There is a constant MAXREPEAT in sre_constants.py that holds this max repetition value:

    >>> import sre_constants
    >>> 
    >>> sre_constants.MAXREPEAT
    65535
    

    (@MarkkuK. and @hcwhsa)

like image 162
arshajii Avatar answered Sep 16 '22 13:09

arshajii