Standard Regex vs python regex discrepancy

Question

I am reading a book and they provide an example of how to match a given string with regular expressions. Here is their example:

b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.

Now I've tried converting it to python like so:

>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')

# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>

My question is two-fold:

What is the equivalent of epsilon in python to make the above example work?
Can someone explain to me why theoretical or standard way of doing regular expressions does not work in python? Might it have something to do with the longest vs shortest matching?

Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory

PierreBdR · Accepted Answer

Actually, the example works just fine ... to a small details. I would write:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression.

As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string.

At last, you can combine both methods by using the first regular expression, but testing at the end:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).

Andriy Drozdyuk · Answer

Thanks for the answers. I feel each answer had part of the answer. Here is what I was looking for.

? symbol is just a shorthand for (something|ε). Thus (a|ε) can be rewritten as a?. So the example becomes:
```
b*(abb*)*a?
```
In python we would write:
```
p = re.compile(r'^b*(abb*)*a?$')
```
The reason straight translation of regular regular expression syntax into python (i.e. copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring.
So for example if we had a string:
```
s = 'aa'
```
Our textbook regex b*(abb*)*a? would not match it because it has two a's. However if we copy it straight to python:
```
>> p = re.compile(r'b*(abb*)*a?')
>> bool(p.match(s))
True
```
This is because our regex matches only the substring 'a' of our string 'aa'.
In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively:
```
>> p = re.compile(r'^b*(abb*)*a?$')
>> bool(p.match(s))
False
```
Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. However the search() function does not, and thus we keep the ^.
So for example:
```
>> s = 'aa'
>> p = re.compile(r'b*(abb*)*a?$')
>> bool(p.match(s))
False                 # Correct
>> bool(p.search(s))
True                  # Incorrect - search ignored the first 'a'
```

Standard Regex vs python regex discrepancy

Tags:

python

regex

theory

Andriy Drozdyuk

2 Answers

PierreBdR

Andriy Drozdyuk

Recent Activity

Donate For Us

Standard Regex vs python regex discrepancy

Tags:

python

regex

theory

Andriy Drozdyuk

2 Answers

PierreBdR

Andriy Drozdyuk

Related questions

Recent Activity

Donate For Us