Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyParsing lookaheads and greedy expressions

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.

My data look like

author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)

And my current parser for this clause is written as:

fieldname = Word(alphas)

operator = OneOrMore(Word(alphas))

single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value

clause = fieldname + originalTextFor(operator) + value

Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.

like image 719
Michael C. O'Connor Avatar asked Feb 11 '12 17:02

Michael C. O'Connor


1 Answers

This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:

operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))

And I think this will start parsing better for you.

Some other tips:

  • I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.

  • Adding results names to the key components of your query string will make it easier to work with later:

    clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")

    Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):

    queryParts = clause.parseString('author is william')

    print queryParts.fieldname

    print queryParts.operator

like image 53
PaulMcG Avatar answered Oct 01 '22 19:10

PaulMcG