Controlling Python PLY lexer states from parser

Question

I am working on a simple SQL select like query parser and I need to be able to capture subqueries that can occur at certain places literally. I found lexer states are the best solution and was able to do a POC using curly braces to mark the start and end. However, the subqueries will be delimited by parenthesis, not curlys, and the parenthesis can occur at other places as well, so I can't being the state with every open-paren. This information is readily available with the parser, so I was hoping to call begin and end at appropriate locations in the parser rules. This however didn't work because lexer seem to tokenize the stream all at once, and so the tokens get generated in the INITIAL state. Is there a workaround for this problem? Here is an outline of what I tried to do:

def p_value_subquery(p):
    """
     value : start_sub end_sub
    """
    p[0] = "( " + p[1] + " )"

def p_start_sub(p):
    """
    start_sub : OPAR
    """
    start_subquery(p.lexer)
    p[0] = p[1]

def p_end_sub(p):
    """
    end_sub : CPAR
    """
    subquery = end_subquery(p.lexer)
    p[0] = subquery

The start_subquery() and end_subquery() are defined like this:

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin('subquery') 

def end_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.lexpos-1]
    lexer.lineno += value.count('\n')
    lexer.begin('INITIAL')
    return value

The lexer tokens are simply there to detect the close-paren:

@lex.TOKEN(r"$")
def t_subquery_SUBQST(t):
    lexer.level += 1

@lex.TOKEN(r"$")
def t_subquery_SUBQEN(t):
    lexer.level -= 1

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

I would appreciate any help.

David Beazley · Accepted Answer

This answer may only be partially helpful, but I would also suggest looking at section "6.11 Embedded Actions" of the PLY documentation (http://www.dabeaz.com/ply/ply.html). In a nutshell, it is possible to write grammar rules in which actions occur mid-rule. It would look something similar to this:

def p_somerule(p):
    '''somerule : A B possible_sub_query LBRACE sub_query RBRACE'''

def p_possible_sub_query(p):
    '''possible_sub_query :'''
    ...
    # Check if the last token read was LBRACE.   If so, flip lexer state
    # Sadly, it doesn't seem that the token is easily accessible. Would have to hack it
    if last_token == 'LBRACE':
        p.lexer.begin('SUBQUERY')

Regarding the behavior of the lexer, there is only one token of lookahead being used. So, in any particular grammar rule, at most only one extra token has been read already. If you're going to flip lexer states, you need to make sure that it happens before the token gets consumed by the parser, but before the parser asks to read the next incoming token.

Also, if possible, I would try to stay out of the yacc() error handling stack as far as a solution. There is way too much black-magic going on in error handling--the more you can avoid it, the better.

I'm a bit pressed for time at the moment, but this seems to be something that could be investigated for the next version of PLY. Will put it on my to-do list.

haridsv · Answer

Based on PLY author's response, I came up with this better solution. I am yet to figure out how to return the subquery as a token, but the rest looks much better and need not be considered a hack anymore.

def start_subquery(lexer):
    lexer.code_start = lexer.lexpos        # Record the starting position
    lexer.level = 1
    lexer.begin("subquery")

def end_subquery(lexer):
    lexer.begin("INITIAL")

def get_subquery(lexer):
    value = lexer.lexdata[lexer.code_start:lexer.code_end-1]
    lexer.lineno += value.count('\n')
    return value

@lex.TOKEN(r"$")
def t_subquery_OPAR(t):
    lexer.level += 1

@lex.TOKEN(r"$")
def t_subquery_CPAR(t):
    lexer.level -= 1
    if lexer.level == 0:
        lexer.code_end = lexer.lexpos        # Record the ending position
        return t

@lex.TOKEN(r".")
def t_subquery_anychar(t):
    pass

def p_value_subquery(p):
    """
    value : check_subquery_start OPAR check_subquery_end CPAR
    """
    p[0] = "( " + get_subquery(p.lexer) + " )"

def p_check_subquery_start(p):
    """
    check_subquery_start : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "OPAR":
        start_subquery(p.lexer)

def p_check_subquery_end(p):
    """
    check_subquery_end : 
    """
    # Here last_token would be yacc's lookahead.
    if last_token.type == "CPAR":
        end_subquery(p.lexer)

last_token = None

def p_error(p):
    global subquery_retry_pos
    if p is None:
        print >> sys.stderr, "ERROR: unexpected end of query"
    else:
        print >> sys.stderr, "ERROR: Skipping unrecognized token", p.type, "("+ \
                p.value+") at line:", p.lineno, "and column:", find_column(p.lexer.lexdata, p)
        # Just discard the token and tell the parser it's okay.
        yacc.errok()

def get_token():
    global last_token
    last_token = lexer.token()
    return last_token

def parse_query(input, debug=0):
    lexer.input(input)
    return parser.parse(input, tokenfunc=get_token, debug=0)

Controlling Python PLY lexer states from parser

Tags:

python

lexer

yacc

ply

haridsv

2 Answers

David Beazley

haridsv

Recent Activity

Donate For Us

Controlling Python PLY lexer states from parser

Tags:

python

lexer

yacc

ply

haridsv

2 Answers

David Beazley

haridsv

Related questions

Recent Activity

Donate For Us