I don't understand why <code>'(\s*)+'</code> gives an error <code>'nothing to repeat'</code>. At the same time <code>'(\s?)+'</code> goes just fine. I've discovered that this problem has been known about quite for some time (for example regex error - nothing to repeat ) but I still see it in Python 3.3.1. So I am wondering if there is a rational explanation for this behavior. In reality I want to match a line of repeated words or numbers, for example: <pre class="prettyprint"><code>'foo foo foo foo' </code></pre> I've come up with this: <pre class="prettyprint"><code>'(\w+)\s+(\1\s*)+' </code></pre> It failed because of the second group: <code>(\1\s*)+</code> In most cases I would probably not have more than 1 space between words so <code>(\1\s?)+</code> would work. For practical purposes this option also should work <code>(\1\s{0,1000})+</code> Update: I think I should add that I've seen the problem in python only. In perl it works: <pre class="prettyprint"><code>`('foo foo foo foo' =~ /(\w+)\s+(\1\s*)+/) ` </code></pre> Not sure it's equivalent but vim also works: <pre class="prettyprint"><code>`\(\<\w\+\>\)\_s\+\(\1\_s*\)\+` </code></pre> Update2: I found another implementation of regex for python which is said to replace current re someday. I checked and the error doesn't occur for the above problematic cases. This module has to be installed separately. It can be downloaded here or via pypi

The problem that python has with this is primarily the null issue brought up in the linked post. If you're going to have at least one character I suggest instead using: <pre class="prettyprint"><code>(\s+)+ </code></pre> That said, it also doesn't really make sense if you ask for <code>(\s*)+</code> with the idea that <code>+</code> requires something to exist, and <code>*</code> does not. It doesn't quite make sense to match <code>?</code> either, but you can resolve it mentally by saying it's an optional match meaning that if it doesn't find one it moves on, rather than <code>*</code> which interprets nothing as a matched pattern. However, if you really want to check what Python's issue with something is I suggest playing around with ranges. For instance, I came to my conclusion by using these two examples: <pre class="prettyprint"><code>re.compile("(\s{1,})+") </code></pre> which is fine <pre class="prettyprint"><code>re.compile("(\s{0,})+") </code></pre> which fails in the same manner. At the very least this means it is not a "bug" in Python. It is a conscious design decision that acts on every regex pattern that conceptually falls into this same pit. My guess (checked in a few different environments) is that <code>(\s{0,})+</code> will reliably fail because it explicitly repeats a potentially null element. However, it seems that a number of environments use <code>*</code> to indicate that a match is optional, and python does not follow this choice. It makes sense for many cases, but occasionally leads to weird behaviour. I think Guido made the right choice here, as having an inconsistent space presence means you've violated the pumping lemma and your pattern is no longer context free. In this case it probably wouldn't matter much, but it means there would inevitably be an ambiguity in that regex that couldn't be resolved. So you had a problem, then you chose to use regex to solve that problem. Now you have 2 problems, C'est la vie.

Is there a reason for python regex not to compile r'(\s*)+'?

Tags:

python

regex

I don't understand why '(\s*)+' gives an error 'nothing to repeat'. At the same time '(\s?)+' goes just fine.

I've discovered that this problem has been known about quite for some time (for example regex error - nothing to repeat ) but I still see it in Python 3.3.1.

So I am wondering if there is a rational explanation for this behavior.

In reality I want to match a line of repeated words or numbers, for example:

'foo foo foo foo'

I've come up with this:

'(\w+)\s+(\1\s*)+'

It failed because of the second group: (\1\s*)+ In most cases I would probably not have more than 1 space between words so (\1\s?)+ would work. For practical purposes this option also should work (\1\s{0,1000})+

Update: I think I should add that I've seen the problem in python only. In perl it works:

`('foo foo foo foo' =~ /(\w+)\s+(\1\s*)+/) `

Not sure it's equivalent but vim also works:

`\(\<\w\+\>\)\_s\+\(\1\_s*\)\+`

Update2: I found another implementation of regex for python which is said to replace current re someday. I checked and the error doesn't occur for the above problematic cases. This module has to be installed separately. It can be downloaded here or via pypi

955

asked Jul 10 '13 20:07

Phoenix

1 Answers

The problem that python has with this is primarily the null issue brought up in the linked post. If you're going to have at least one character I suggest instead using:

(\s+)+

That said, it also doesn't really make sense if you ask for (\s*)+ with the idea that + requires something to exist, and * does not. It doesn't quite make sense to match ? either, but you can resolve it mentally by saying it's an optional match meaning that if it doesn't find one it moves on, rather than * which interprets nothing as a matched pattern.

However, if you really want to check what Python's issue with something is I suggest playing around with ranges. For instance, I came to my conclusion by using these two examples:

re.compile("(\s{1,})+")

which is fine

re.compile("(\s{0,})+")

which fails in the same manner.

At the very least this means it is not a "bug" in Python. It is a conscious design decision that acts on every regex pattern that conceptually falls into this same pit. My guess (checked in a few different environments) is that (\s{0,})+ will reliably fail because it explicitly repeats a potentially null element.

However, it seems that a number of environments use * to indicate that a match is optional, and python does not follow this choice. It makes sense for many cases, but occasionally leads to weird behaviour. I think Guido made the right choice here, as having an inconsistent space presence means you've violated the pumping lemma and your pattern is no longer context free.

In this case it probably wouldn't matter much, but it means there would inevitably be an ambiguity in that regex that couldn't be resolved.

So you had a problem, then you chose to use regex to solve that problem. Now you have 2 problems, C'est la vie.

answered Nov 15 '22 22:11

Slater Victoroff

Related questions
                            
                                How to merge 2 Shapely objects?
                            
                                gdb: break in shared library loaded by python
                            
                                argumentparser close file argument
                            
                                Broken Pipe Error while running django-test with selenium
                            
                                Why python finds module instead of package if they have the same name?
                            
                                Curses Difference between newwin and subwin
                            
                                Draw on top of the screen using xlib
                            
                                How to use global variable in python, in a threadsafe way
                            
                                cmake link with libboost_python-py32.so instead of libboost_python.so
                            
                                splitting wav file in python
                            
                                What is the difference between [[0 for _ in range(10)] for _ in range(10)] and [0 for _ in range(10)] * 10? [duplicate]
                            
                                Access global variables from a function in an imported module
                            
                                Is there a program like GitLab written in Python? [closed]
                            
                                Fastest way to merge n-dictionaries and add values on 2.6 [duplicate]
                            
                                Monitoring a single file
                            
                                How to run multiple Selenium Firefox browsers concurrently?
                            
                                Celery: Callback after task hierarchy
                            
                                Why can't Python's string.format pad with "\x00"?
                            
                                Python codecs line ending
                            
                                django-celery-email task isn't executed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With