Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard Regex vs python regex discrepancy

I am reading a book and they provide an example of how to match a given string with regular expressions. Here is their example:

b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.

Now I've tried converting it to python like so:

>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')

# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>

My question is two-fold:

  1. What is the equivalent of epsilon in python to make the above example work?
  2. Can someone explain to me why theoretical or standard way of doing regular expressions does not work in python? Might it have something to do with the longest vs shortest matching?

Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory

like image 387
Andriy Drozdyuk Avatar asked Jan 12 '10 14:01

Andriy Drozdyuk


2 Answers

Actually, the example works just fine ... to a small details. I would write:

>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba

Note that the group 0 returns the part of the string matched by the regular expression.

As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:

>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None

the ^ and $ force recognition of the beginning and end of the string.

At last, you can combine both methods by using the first regular expression, but testing at the end:

>>> len(m.group(0)) == len('aa')

Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).

like image 160
PierreBdR Avatar answered Sep 22 '22 17:09

PierreBdR


Thanks for the answers. I feel each answer had part of the answer. Here is what I was looking for.

  1. ? symbol is just a shorthand for (something|ε). Thus (a|ε) can be rewritten as a?. So the example becomes:

    b*(abb*)*a?
    

    In python we would write:

    p = re.compile(r'^b*(abb*)*a?$')
    
  2. The reason straight translation of regular regular expression syntax into python (i.e. copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring.
    So for example if we had a string:

    s = 'aa'
    

    Our textbook regex b*(abb*)*a? would not match it because it has two a's. However if we copy it straight to python:

    >> p = re.compile(r'b*(abb*)*a?')
    >> bool(p.match(s))
    True
    

    This is because our regex matches only the substring 'a' of our string 'aa'.
    In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively:

    >> p = re.compile(r'^b*(abb*)*a?$')
    >> bool(p.match(s))
    False
    

    Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. However the search() function does not, and thus we keep the ^.
    So for example:

    >> s = 'aa'
    >> p = re.compile(r'b*(abb*)*a?$')
    >> bool(p.match(s))
    False                 # Correct
    >> bool(p.search(s))
    True                  # Incorrect - search ignored the first 'a'
    
like image 33
Andriy Drozdyuk Avatar answered Sep 26 '22 17:09

Andriy Drozdyuk