I am using ply and have noticed a strange discrepancy between the token re match stored in t.lex.lexmatch, as compared with an sre_pattern defined in the usual way with the re module. The group(x)'s seem to be off by 1.
I have defined a simple lexer to illustrate the behavior I am seeing:
import ply.lex as lex
tokens = ('CHAR',)
def t_CHAR(t):
r'.'
t.value = t.lexer.lexmatch
return t
l = lex.lex()
(I get a warning about t_error but ignore it for now.) Now I feed some input into the lexer and get a token:
l.input('hello')
l.token()
I get a LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0)
. I want to look a the match object:
m = _.value
So now I look at the groups:
m.group()
=> 'h'
as I expect.
m.group(0)
=> 'h'
as I expect.
m.group(1)
=> 'h'
, yet I would expect it to not have such a group.
Compare this to creating such a regular expression manually:
import re
p = re.compile(r'.')
m2 = p.match('hello')
This gives different groups:
m2.group()
= 'h'
as I expect.
m2.group(0)
= 'h'
as I expect.
m2.group(1)
gives IndexError: no such group
as I expect.
Does anyone know why this discrepancy exists?
In version 3.4 of PLY, the reason this occurs is related to how the expressions are converted from docstrings to patterns.
Looking at the source really does help - line 746 of lex.py:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
I wouldn't recommend relying on something like this between versions - this is just part of the magic of how PLY works.
it seems for me that matching group depends on position of the token function in the file, like if groups were actually cumulated through all the declared tokens regexes :
t_MYTOKEN1(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(1)
return t
t_MYTOKEN2(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(2)
return t
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With