Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex slow when whitespace in string

Tags:

python

regex

I would like to match strings, using Pythons regex module.

In my case I want to verify that strings start, end and consist of upper case letters combined by "_". As an example, the following string is valid: "MY_HERO2". The following strings are not valid: "_MY_HREO2", "MY HERO2", "MY_HERO2_"

To validate a string I use this code:

import re
my_string = "MY_HERO"   

p = re.compile("^([A-Z,0-9]+_??)+[A-Z,0-9]$")
if p.match(my_string):
    print "validated"

So what is my problem? Validating long string containing whitespaces is very, very slow. How can I avoid this? Is my pattern wrong? What is the reason for this behavior?

Here are some numbers:

MY_HERO2 --> 53 ms
MY_SUPER_GREAT_UNBELIEVABLE_HERO --> 69 microseconds
MY_SUPER_GREAT_UNBELIEVABLE HERO --> 223576 microseconds
MY_SUPER_GREAT_UNBELIEVABLE_STRONG_HERO --> 15 microseconds
MY_SUPER_GREAT_UNBELIEVABLE_STRONG HERO --> 979429 microseconds

Thanks for your anwsers and responses in advance. :-) Paul

like image 523
Peter Avatar asked Aug 28 '13 10:08

Peter


Video Answer


1 Answers

So what is my problem?

The problem is catastrophic backtracking. The regex engine is trying a whole lot of variations which is taking a lot of time.

Let's try this with a pretty simple example: A_B D.

The engine first matches A with [A-Z,0-9]+ then it now tries _?? but since it is optional (lazy) it skips it for now, and this has finished ([A-Z,0-9]+_??)+.

Now the engine tries to match [A-Z,0-9] but there is a _ in the string so it fails and now it needs to backtrack, so it re-enters ([A-Z,0-9]+_??)+ where it failed the last time and tries _?? and succeeds.

Now the engine exits ([A-Z,0-9]+_??)+ again and tries to match [A-Z,0-9] and it succeeds, then it tries to match end-of-string $ but fails, now it backtracks and enters ([A-Z,0-9]+_??)+ again.

I hope you see where this is going as I am tiered of writing it and we haven't reached the space character yet -actually any character not accepted in your regex such as # or %, etc will cause this, not just the whitespace- and this is a small tiny example, in the case of your long strings, it will have to do this hundreds and hundreds of times until it is able to match the entire string or fails, hence the great amount of time.

Validating long string containing whitespaces is very, very slow.

Again this due to backtracking and hell of variations.

How can I avoid this?

You can use this regex instead:

^([A-Z0-9]_?)*[A-Z0-9]$

This ensures the string starts with an uppercase or number followed by an optional _, repeat this one or more times and just make sure there is an uppercase or a number at the end.

Is my pattern wrong? What is the reason for this behavior?

Your expression is not wrong, but it is highly inefficient.

([A-Z,0-9]+_??)+[A-Z,0-9]$
          ^  ^ ^
        You  |see those two, they are a lot of trouble together 
             |
            These two ?? with the the other two + on the inside and outside, hell :)
like image 197
Ibrahim Najjar Avatar answered Oct 13 '22 09:10

Ibrahim Najjar