I'm not sure I completely understand what is going on with the following regular expression search: <pre class="prettyprint"><code>>>> import re >>> template = re.compile("(\w+)+\.") >>> target = "a" * 30 >>> template.search(target) </code></pre> <code>search()</code> call takes minutes to complete, CPU usage goes to 100%. The behavior is reproduceable for both 2.7.5 and 3.3.3 Python versions. Interesting fact that if the string is less than 20-25 characters in length, <code>search()</code> returns like in no time. What is happening?

Understanding this problem requires understanding how NFA works under RegExp. Elaborating the definition of NFA may be a mission too heavy for me. Search NFA on wiki it will gives you a better explanation. Here just think NFA is a robot finding patterns you give. Crudely implemented NFA is somewhat dumb, it just looks ahead one or two tokens you give. So in the synthetic example you give, NFA just looks <code>\w+</code> at first (not parenthesis for grouping). Because <code>+</code> is a greedy quantifier, that is, matches as many characters as possible, so NFA dumbly continues to consume characters in <code>target</code>. After 30 <code>a</code>s, NFA encounters the end of string. After then does NFA realize that he needs to match other tokens in <code>template</code>. The next token is <code>+</code>. NFA has matched it so it proceeds to <code>\.</code>. This time it fails. What NFA does next is to step one character back, trying to match the pattern by truncating the submatching of <code>\w+</code>. So NFA split the <code>target</code> in to two groups, 29 <code>a</code>s for one <code>\w+</code>, and one trailing <code>a</code>. NFA first tries to consume the trailing a by matching it against the second <code>+</code>, but it still fails when NFA meeting <code>\.</code>. NFA continues the process above until it gets a full match, otherwise it will tries all possible partitions. So <code>(\w+)+\.</code> instructs NFA to group <code>target</code> in such manner: target is partitioned into one or more groups, at least one character per group, and target is end with a period '.'. As long as the period is not matched. NFA tries all partitions possible. So how many partitions are there? 2^n, the exponential of 2. (JUst think inserting separator between <code>a</code>). Like below <pre class="prettyprint"><code>aaaaaaa a aaaaaa aa aaaaaa a a ..... ....... aa a a ... a a a a a a .... a </code></pre> If NFA matches <code>\.</code>, it won't hurt much. But when it fails to match, this expression is doomed to be never-ending exponential . I'm not advertising but Mastering Regular Expression is a good book to understand mechanism under RegExp.

Very slow regular expression search

Tags:

performance

python

string

regex

I'm not sure I completely understand what is going on with the following regular expression search:

>>> import re
>>> template = re.compile("(\w+)+\.")
>>> target = "a" * 30
>>> template.search(target)

search() call takes minutes to complete, CPU usage goes to 100%. The behavior is reproduceable for both 2.7.5 and 3.3.3 Python versions.

Interesting fact that if the string is less than 20-25 characters in length, search() returns like in no time.

What is happening?

843

asked Mar 26 '14 02:03

alecxe

1 Answers

Understanding this problem requires understanding how NFA works under RegExp.

Elaborating the definition of NFA may be a mission too heavy for me. Search NFA on wiki it will gives you a better explanation. Here just think NFA is a robot finding patterns you give.

Crudely implemented NFA is somewhat dumb, it just looks ahead one or two tokens you give. So in the synthetic example you give, NFA just looks \w+ at first (not parenthesis for grouping).

Because + is a greedy quantifier, that is, matches as many characters as possible, so NFA dumbly continues to consume characters in target. After 30 as, NFA encounters the end of string. After then does NFA realize that he needs to match other tokens in template. The next token is +. NFA has matched it so it proceeds to \.. This time it fails.

What NFA does next is to step one character back, trying to match the pattern by truncating the submatching of \w+. So NFA split the target in to two groups, 29 as for one \w+, and one trailing a. NFA first tries to consume the trailing a by matching it against the second +, but it still fails when NFA meeting \.. NFA continues the process above until it gets a full match, otherwise it will tries all possible partitions.

So (\w+)+\. instructs NFA to group target in such manner: target is partitioned into one or more groups, at least one character per group, and target is end with a period '.'. As long as the period is not matched. NFA tries all partitions possible. So how many partitions are there? 2^n, the exponential of 2. (JUst think inserting separator between a). Like below

aaaaaaa a
aaaaaa aa
aaaaaa a a
.....
.......
aa a a ... a
a a a a a .... a

If NFA matches \., it won't hurt much. But when it fails to match, this expression is doomed to be never-ending exponential .

I'm not advertising but Mastering Regular Expression is a good book to understand mechanism under RegExp.

138

answered Sep 26 '22 14:09

Herrington Darkholme

Related questions
                            
                                Comparing two images pixel-wise with PIL (Python Imaging Library)
                            
                                Python requests can't send multiple headers with same key
                            
                                python ctypes array of structs
                            
                                Pandas dataframe: Check if data is monotonically decreasing
                            
                                Python: EOFError: EOF when reading a line
                            
                                How to write custom django manage.py commands in multiple apps
                            
                                PySide Import Error on Ubuntu 13.04
                            
                                Can't print character '\u2019' in Python from JSON object
                            
                                How to plot pcolor colorbar in a different subplot - matplotlib
                            
                                Boto s3 get_metadata
                            
                                How can I concatenate a Series onto a DataFrame with Pandas?
                            
                                How do I make `python setup.py test -q` quieter?
                            
                                Python multiprocessing and independence of children processes
                            
                                Timezone Information Missing in pytz?
                            
                                PyCharm SQLAlchemy autocomplete
                            
                                Changing what the ends of whiskers represent in matplotlib's boxplot function
                            
                                Streaming data from Postgres into Python
                            
                                Explaining the differences between dim, shape, rank, dimension and axis in numpy
                            
                                AttributeError: 'module' object has no attribute 'python_implementation' running pip
                            
                                Python - Check if list of lists of lists contains a specific list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With