Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does PLY treat regular expressions differently from Python/re?

Tags:

python

regex

ply

Some background:

I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.

Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).

Here we go:

import ply.lex as lex

text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
   'TEST',
)
t_TEST = token1

lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
    print tok.type, tok.value, tok.lineno, tok.lexpos

results in:

lex: tokens   = ('TEST',)
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0

The last line is surprising - I would have expected the first and the last - to be missing in --- 123456 --- in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then -- cannot be distinguished from --- (or === from ===), i.e. headlines, enumbering, ... cannot be differentiated.

So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at stackoverflow).

I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.

Am I overlooking something stupid simple?

For comparison purposes here standard Python / regex:

import re

text = r'--- 123456 ---'
token1 = r'-- .* --'

p = re.compile(token1)

m = p.search(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

m = p.match(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

gives:

Match found:  -- 123456 --
No match

(as expected, first is the result of "search", second of "match")

My settings: I am working with spyder - this is the terminal display at start:

Python 2.7.5+ (default, Sep 19 2013, 13:49:51) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.

Thanks for your time and help.

like image 757
programkai Avatar asked Oct 01 '22 12:10

programkai


1 Answers

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. It means the re engine ignores the whitespace characters in your regexps. So r'-- .* --' really means r'--.*--', which indeed matches completely a string like '--- foobar ---'. See the documentation of re.VERBOSE for more details.

like image 175
Armin Rigo Avatar answered Oct 05 '22 22:10

Armin Rigo