Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ply lexmatch regular expression has different groups than a usual re

I am using ply and have noticed a strange discrepancy between the token re match stored in t.lex.lexmatch, as compared with an sre_pattern defined in the usual way with the re module. The group(x)'s seem to be off by 1.

I have defined a simple lexer to illustrate the behavior I am seeing:

import ply.lex as lex

tokens = ('CHAR',)

def t_CHAR(t):
    r'.'
    t.value = t.lexer.lexmatch
    return t

l = lex.lex()

(I get a warning about t_error but ignore it for now.) Now I feed some input into the lexer and get a token:

l.input('hello')
l.token()

I get a LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0). I want to look a the match object:

m = _.value

So now I look at the groups:

m.group() => 'h' as I expect.

m.group(0) => 'h' as I expect.

m.group(1) => 'h', yet I would expect it to not have such a group.

Compare this to creating such a regular expression manually:

import re
p = re.compile(r'.')
m2 = p.match('hello')

This gives different groups:

m2.group() = 'h' as I expect.

m2.group(0) = 'h' as I expect.

m2.group(1) gives IndexError: no such group as I expect.

Does anyone know why this discrepancy exists?

like image 930
murftown Avatar asked Sep 17 '11 01:09

murftown


2 Answers

In version 3.4 of PLY, the reason this occurs is related to how the expressions are converted from docstrings to patterns.

Looking at the source really does help - line 746 of lex.py:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

I wouldn't recommend relying on something like this between versions - this is just part of the magic of how PLY works.

like image 175
Andrew Walker Avatar answered Sep 22 '22 11:09

Andrew Walker


it seems for me that matching group depends on position of the token function in the file, like if groups were actually cumulated through all the declared tokens regexes :

   t_MYTOKEN1(t):
      r'matchit(\w+)'
      t.value = lexer.lexmatch.group(1)
      return t

   t_MYTOKEN2(t):
      r'matchit(\w+)'
      t.value = lexer.lexmatch.group(2)
      return t
like image 42
lolo Avatar answered Sep 21 '22 11:09

lolo