The following Python code is incredibly slow:
import re
re.match( '([a]+)+c', 'a' * 30 + 'b' )
and it gets worse if you replace 30 with a larger constant.
I suspect that the parsing ambiguity due to the consecutive +
is the culprit, but I'm not very expert in regexp parsing and matching. Is this a bug of the Python regexp engine, or any reasonable implementation will do the same?
I'm not a Perl expert, but the following returns quite fast
perl -e '$s="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; print "ok\n" if $s =~ m/([a]+)+c/;'
and increasing the number of 'a' does not alter substantially the execution speed.
Performance of the Best Regex When I reran the test, the best regex took about the same amount of time to match the non-matching input, but the matching input took only on average 800 milliseconds to run, as opposed to 4,700 milliseconds for the better regex and a whopping 17,000 milliseconds for the bad regex.
Yet matching a string with a regex can be surprisingly slow. So slow it can even stop any JS app or take 100% of a server CPU time causing denial of service (DOS). At TextMaster we used regular expressions in an important part of our translation platform.
Expose Literal Characters Regex engines match fastest when anchors and literal characters are right there in the main pattern, rather than buried in sub-expressions. Hence the advice to "expose" literal characters whenever you can take them out of an alternation or quantified expression. Let's look at two examples.
While a well written regular expression can be very efficient, a badly written regular expression might take a long time to run and slow the system down significantly.
I assume that Perl is clever enough to collapse the two +
s into one, while Python is not. Now let's imagine what the engine does, if this is not optimized away. And remember that capturing is generally expensive. Note also, that both +
s are greedy, so the engine will try to use as many repetitions as possible in one backtracking step. Each bullet point represents one backtracking step:
[a]
as possible, and consumes all thirty a
s. Then it can not go any further, so it leaves the first repetition and captures 30 a
s. Now the next repetition is on and it tries to consume some more with another ([a]+)
but that doesn't work of course. And then the c
fails to match b
.a
consumed by the inner repetition. After this we leave the inner repetition again, so the engine will capture 29 a
s. Then the other +
kicks in, the inner repetition is tried out again (consuming the 30th a
). Then we leave the inner repetition once again, which also leaves the capturing group, so the first capture is thrown away and the engine captures the last a
. c
fails to match b
.a
inside. We capture 28 a
s. The second (outer repetition) of the capturing group consumes the last 2 a
s which are captured. c
fails to match b
.a
s. The one that is left will be captured. Then we enter the capturing group for the third time and capture the last a
. c
fails to match b
.a
s in the first repetition.Here is a simple visualization. Each line represents one backtracking step, and each set of parentheses shows one consumption of the inner repetition. The curly brackets represent those that are newly captured for that step of backtracking, while normal parentheses are not revisited in this particular backtracking step. And I leave out the b
/c
because it will never be matched:
{aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaaa}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaaa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaaa}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{a}
{aaaaaaaaaaaaaaaaaaaaaaaaaa}{aaaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aaa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){aa}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa){a}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa){a}{aaa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){aa}{a}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a){a}{aa}
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(a)(a){a}{a}
And. so. on.
Note that in the end the engine will also try all combinations for subsets of a
(backtracking just through the first 29 a
s then through the first 28 a
s) just to discover, that c
does also not match a
.
The explanation of regex engine internals is based on information scattered around regular-expressions.info.
To solve this. Simply remove one of the +
s. Either r'a+c'
or if you do want to capture the amount of a
s use r'(a+)s'
.
Finally, to answer your question. I would not consider this a bug in Python's regex engine, but only (if anything) a lack in optimization logic. This problem is not generally solvable, so it is not too unreasonably for an engine to assume, that you have to take care of catastrophic backtracking yourself. If Perl is clever enough to recognize sufficiently simple cases of it, so much the better.
Rewrite your regular expression to eliminate the "catastrophic backtracking", by removing the nested quantifiers (see this question):
re.match( '([a]+)+c', 'a' * 30 + 'b' )
# becomes
re.match( 'a+c', 'a' * 30 + 'b' )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With