I don't understand why '(\s*)+'
gives an error 'nothing to repeat'
. At the same time '(\s?)+'
goes just fine.
I've discovered that this problem has been known about quite for some time (for example regex error - nothing to repeat ) but I still see it in Python 3.3.1.
So I am wondering if there is a rational explanation for this behavior.
In reality I want to match a line of repeated words or numbers, for example:
'foo foo foo foo'
I've come up with this:
'(\w+)\s+(\1\s*)+'
It failed because of the second group: (\1\s*)+
In most cases I would probably not have more than 1 space between words so (\1\s?)+
would work. For practical purposes this option also should work (\1\s{0,1000})+
Update: I think I should add that I've seen the problem in python only. In perl it works:
`('foo foo foo foo' =~ /(\w+)\s+(\1\s*)+/) `
Not sure it's equivalent but vim also works:
`\(\<\w\+\>\)\_s\+\(\1\_s*\)\+`
Update2: I found another implementation of regex for python which is said to replace current re someday. I checked and the error doesn't occur for the above problematic cases. This module has to be installed separately. It can be downloaded here or via pypi
By compiling once and re-using the same regex multiple times, we reduce the possibility of typos. When you are using lots of different regexes, you should keep your compiled expressions for those which are used multiple times, so they're not flushed out of the regex cache when the cache is full.
The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.
In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.
The problem that python has with this is primarily the null issue brought up in the linked post. If you're going to have at least one character I suggest instead using:
(\s+)+
That said, it also doesn't really make sense if you ask for (\s*)+
with the idea that +
requires something to exist, and *
does not. It doesn't quite make sense to match ?
either, but you can resolve it mentally by saying it's an optional match meaning that if it doesn't find one it moves on, rather than *
which interprets nothing as a matched pattern.
However, if you really want to check what Python's issue with something is I suggest playing around with ranges. For instance, I came to my conclusion by using these two examples:
re.compile("(\s{1,})+")
which is fine
re.compile("(\s{0,})+")
which fails in the same manner.
At the very least this means it is not a "bug" in Python. It is a conscious design decision that acts on every regex pattern that conceptually falls into this same pit. My guess (checked in a few different environments) is that (\s{0,})+
will reliably fail because it explicitly repeats a potentially null element.
However, it seems that a number of environments use *
to indicate that a match is optional, and python does not follow this choice. It makes sense for many cases, but occasionally leads to weird behaviour. I think Guido made the right choice here, as having an inconsistent space presence means you've violated the pumping lemma and your pattern is no longer context free.
In this case it probably wouldn't matter much, but it means there would inevitably be an ambiguity in that regex that couldn't be resolved.
So you had a problem, then you chose to use regex to solve that problem. Now you have 2 problems, C'est la vie.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With