Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

100% CPU usage with a regexp depending on input length

I'm trying to come up with a regexp in Python that has to match any character but avoiding three or more consecutive commas or semicolons. In other words, only up to two consecutive commas or semicolons are allowed.

So this is what I currently have:

^(,|;){,2}([^,;]+(,|;){,2})*$

And it seems to work as expected:

>>> r.match('')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, ,,a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,,,')
>>> r.match('foo, ,,,;')
>>> r.match('foo, ,, ;;')
<_sre.SRE_Match object at 0x7f23af840750>

But as I start to increase the length of the input text, the regexp seems to need way more time to give a response.

>>> r.match('foo, bar, baz,, foo')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,,')
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,,,')
>>> r.match('foo, bar, baz,, fooooo, baaaaar, baaaaaaz,,,,')

And finally it gets completely stuck at this stage and the CPU usage goes up to 100%.

I'm not sure if the regexp could be optimized or there's something else involved, any help appreciated.

like image 672
julen Avatar asked Jun 03 '11 17:06

julen


2 Answers

You're running into catastrophic backtracking.

The reason for this is that you have made the separators optional, and therefore the [^,;]+ part (which is itself in a repeating group) of your regex will try loads of permutations (of baaaaaaaz) before finally having to admit failure when confronted with more than two commas.

RegexBuddy aborts the match attempt after 1.000.000 steps of the regex engine with your last test string. Python will keep trying.

Imagine the string baaz,,,:

Trying your regex, the regex engine has to check all these:

  1. baaz,,<failure>
  2. baa + z,,<failure>
  3. ba + az,,<failure>
  4. ba + a + z,,<failure>
  5. b + aaz,,<failure>
  6. b + aa + z,,<failure>
  7. b + a + az,,<failure>
  8. b + a + a +z,,<failure>

before declaring overall failure. See how this grows exponentially with each additional character?

Behavior like this can be avoided with possessive quantifiers or atomic groups, both of which are sadly not supported by Python's current regex engine. But you can do an inverse check easily:

if ",,," in mystring or ";;;" in mystring:
    fail()

without needing a regex at all. If ,;, and the likes could also occur and should be excluded, then use Andrew's solution.

like image 120
Tim Pietzcker Avatar answered Sep 18 '22 07:09

Tim Pietzcker


I think the following should do what you want:

^(?!.*[,;]{3})

This will fail if the string contains three or more , or ; in a row. If you actually want it to match a character add a . at the end.

This utilizes negative lookahead, which will cause the entire match to fail if the regex .*[,;]{3} would match.

like image 20
Andrew Clark Avatar answered Sep 18 '22 07:09

Andrew Clark