Some background:
I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.
Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).
Here we go:
import ply.lex as lex
text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
'TEST',
)
t_TEST = token1
lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
print tok.type, tok.value, tok.lineno, tok.lexpos
results in:
lex: tokens = ('TEST',)
lex: literals = ''
lex: states = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0
The last line is surprising - I would have expected the first and the last -
to be missing in --- 123456 ---
in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then --
cannot be distinguished from ---
(or ===
from ===
), i.e. headlines, enumbering, ... cannot be differentiated.
So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at stackoverflow).
I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.
Am I overlooking something stupid simple?
For comparison purposes here standard Python / regex:
import re
text = r'--- 123456 ---'
token1 = r'-- .* --'
p = re.compile(token1)
m = p.search(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
m = p.match(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
gives:
Match found: -- 123456 --
No match
(as expected, first is the result of "search", second of "match")
My settings: I am working with spyder - this is the terminal display at start:
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.
Thanks for your time and help.
The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
Notice the VERBOSE
flag. It means the re
engine ignores the whitespace characters in your regexps. So r'-- .* --'
really means r'--.*--'
, which indeed matches completely a string like '--- foobar ---'
. See the documentation of re.VERBOSE
for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With