Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyparsing nestedExpr and nested parentheses

I am working on a very simple "querying syntax" usable by people with reasonable technical skills (i.e., not coders per se, but able to touch on the subject)

A typical example of what they would enter on a form is:

address like street
AND
vote =  True
AND
(
  (
    age>=25
    AND
    gender = M
  )
  OR
  (
    age between [20,30]
    AND
    gender = F
  )
  OR
  (
    age >= 70
    AND
    eyes != blue
  )
)

With

  1. no quote required
  2. potentially infinite nesting of parentheses
  3. simple AND|OR linking

I am using pyparsing (well, trying to anyway) and reaching something:

from pyparsing import *

OPERATORS = [
    '<',
    '<=',
    '>',
    '>=',
    '=',
    '!=',
    'like'
    'regexp',
    'between'
]

unicode_printables = u''.join(unichr(c) for c in xrange(65536)
                              if not unichr(c).isspace())

# user_input is the text sent by the client form
user_input = ' '.join(user_input.split())
user_input = '(' + user_input + ')'

AND = Keyword("AND").setName('AND')
OR = Keyword("OR").setName('OR')

FIELD = Word(alphanums).setName('FIELD')
OPERATOR = oneOf(OPERATORS).setName('OPERATOR')
VALUE = Word(unicode_printables).setName('VALUE')
CRITERION = FIELD + OPERATOR + VALUE

QUERY = Forward()
NESTED_PARENTHESES = nestedExpr('(', ')')
QUERY << ( CRITERION | AND | OR | NESTED_PARENTHESES )

RESULT = QUERY.parseString(user_input)
RESULT.pprint()

The output is:

[['address',
  'like',
  'street',
  'AND',
  'vote',
  '=',
  'True',
  'AND',
  [['age>=25', 'AND', 'gender', '=', 'M'],
   'OR',
   ['age', 'between', '[20,30]', 'AND', 'gender', '=', 'F'],
   'OR',
   ['age', '>=', '70', 'AND', 'eyes', '!=', 'blue']]]]

Which I am only partially happy with - the main reason being that the desired final output would look like this:

[
  {
    "field" : "address",
    "operator" : "like",
    "value" : "street",
  },
  'AND',
  {
    "field" : "vote",
    "operator" : "=",
    "value" : True,
  },
  'AND',
  [
    [
      {
        "field" : "age",
        "operator" : ">=",
        "value" : 25,
      },
      'AND'
      {
        "field" : "gender",
        "operator" : "=",
        "value" : "M",
      }
    ],
    'OR',
    [
      {
        "field" : "age",
        "operator" : "between",
        "value" : [20,30],
      },
      'AND'
      {
        "field" : "gender",
        "operator" : "=",
        "value" : "F",
      }
    ],
    'OR',
    [
      {
        "field" : "age",
        "operator" : ">=",
        "value" : 70,
      },
      'AND'
      {
        "field" : "eyes",
        "operator" : "!=",
        "value" : "blue",
      }
    ],
  ]
]

Many thanks!

EDIT

After Paul's answer, this is what the code looks like. Obviously it works much more nicely :-)

unicode_printables = u''.join(unichr(c) for c in xrange(65536)
                              if not unichr(c).isspace())

user_input = ' '.join(user_input.split())

AND = oneOf(['AND', '&'])
OR = oneOf(['OR', '|'])
FIELD = Word(alphanums)
OPERATOR = oneOf(OPERATORS)
VALUE = Word(unicode_printables)
COMPARISON = FIELD + OPERATOR + VALUE

QUERY = infixNotation(
    COMPARISON,
    [
        (AND, 2, opAssoc.LEFT,),
        (OR, 2, opAssoc.LEFT,),
    ]
)

class ComparisonExpr:
    def __init__(self, tokens):
        self.tokens = tokens

    def __str__(self):
        return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens.asList())

COMPARISON.addParseAction(ComparisonExpr)

RESULT = QUERY.parseString(user_input).asList()
print type(RESULT)
from pprint import pprint
pprint(RESULT)

The output is:

[
  [
    <[snip]ComparisonExpr instance at 0x043D0918>,
    'AND',
    <[snip]ComparisonExpr instance at 0x043D0F08>,
    'AND',
    [
      [
        <[snip]ComparisonExpr instance at 0x043D3878>,
        'AND',
        <[snip]ComparisonExpr instance at 0x043D3170>
      ],
      'OR',
      [
        [
          <[snip]ComparisonExpr instance at 0x043D3030>,
          'AND',
          <[snip]ComparisonExpr instance at 0x043D3620>
        ],
        'AND',
        [
          <[snip]ComparisonExpr instance at 0x043D3210>,
          'AND',
          <[snip]ComparisonExpr instance at 0x043D34E0>
        ]
      ]
    ]
  ]
]

Is there a way to return RESULT with dictionaries and not ComparisonExpr instances?

EDIT2

Came up with a naive and very specific solution, but which works for me so far:

[snip]
class ComparisonExpr:
    def __init__(self, tokens):
        self.tokens = tokens

    def __str__(self):
        return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens.asList())

    def asDict(self):
        return {
            "field": self.tokens.asList()[0],
            "operator": self.tokens.asList()[1],
            "value": self.tokens.asList()[2]
        }

[snip]
RESULT = QUERY.parseString(user_input).asList()[0]
def convert(list):
    final = []
    for item in list:
        if item.__class__.__name__ == 'ComparisonExpr':
            final.append(item.asDict())
        elif item in ['AND', 'OR']:
            final.append(item)
        elif item.__class__.__name__ == 'list':
            final.append(convert(item))
        else:
            print 'ooops forgotten something maybe?'

    return final

FINAL = convert(RESULT)
pprint(FINAL)

Which outputs:

[{'field': 'address', 'operator': 'LIKE', 'value': 'street'},
   'AND',
   {'field': 'vote', 'operator': '=', 'value': 'true'},
   'AND',
   [[{'field': 'age', 'operator': '>=', 'value': '25'},
     'AND',
     {'field': 'gender', 'operator': '=', 'value': 'M'}],
    'OR',
    [[{'field': 'age', 'operator': 'BETWEEN', 'value': '[20,30]'},
      'AND',
      {'field': 'gender', 'operator': '=', 'value': 'F'}],
     'AND',
     [{'field': 'age', 'operator': '>=', 'value': '70'},
      'AND',
      {'field': 'eyes', 'operator': '!=', 'value': 'blue'}]]]]

Again thanks to Paul for pointing me if a right direction!

The only thing unknown left is for me to turn 'true' into True and '[20,30]' into [20, 30].

like image 695
Hal Avatar asked May 25 '17 00:05

Hal


1 Answers

nestedExpr is a convenience expression in pyparsing, to make it easy to define text with matched opening and closing characters. When you want to parse the nested contents, then nestedExpr is usually not well structured enough.

The query syntax you are trying to parse is better served using pyparsing's infixNotation method. You can see several examples at the pyparsing wiki's Examples page - SimpleBool is is very similar to what you are parsing.

"Infix notation" is a general parsing term for expressions where the operator is in between its related operands (vs. "postfix notation" where the operator follows the operands, as in "2 3 +" instead of "2 + 3"; or "prefix notation" which looks like "+ 2 3"). Operators can have an order of precedence in evaluation that can override left-to-right order - for instance, in "2 + 3 * 4", precedence of operations dictates that multiplication gets evaluated before addition. Infix notation also supports using parentheses or other grouping characters to override that precedence, as in "(2 + 3) * 4" to force the addition operation to be done first.

pyparsing's infixNotation method takes a base operand expression, and then a list of operator definition tuples, in order of precedence. For instance, 4-function integer arithmetic would look like:

parser = infixNotation(integer,
             [
             (oneOf('* /'), 2, opAssoc.LEFT),
             (oneOf('+ -'), 2, opAssoc.LEFT),
             ])

Meaning that we will parse integer operands, with '*' and '/' binary left-associative operations and '+' and '-' binary operations, in that order. Support for parentheses to override the order is built into infixNotation.

Query strings are often some combination of boolean operations NOT, AND, and OR, and typically evaluated in that order of precedence. In your case, the operands for these operators are comparison expressions, like "address = street" or "age between [20,30]". So if you define an expression for a comparison expression, of the form fieldname operator value, then you can use infixNotation to do the right grouping of AND's and OR's:

import pyparsing as pp
query_expr = pp.infixNotation(comparison_expr,
                [
                    (NOT, 1, pp.opAssoc.RIGHT,),
                    (AND, 2, pp.opAssoc.LEFT,),
                    (OR, 2, pp.opAssoc.LEFT,),
                ])

Finally, I suggest you define a class to take the comparison tokens as class init args, then you can attach behavior to that class to evaluate the comparisons and output debug strings, something like:

class ComparisonExpr:
    def __init__(self, tokens):
        self.tokens = tokens

    def __str__(self):
        return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(
                            *self.tokens.asList())

# attach the class to the comparison expression
comparison_expr.addParseAction(ComparisonExpr)

Then you can get output like:

query_expr.parseString(sample).pprint()

[[Comparison:({'field': 'address', 'operator': 'like', 'value': 'street'}),
  'AND',
  Comparison:({'field': 'vote', 'operator': '=', 'value': True}),
  'AND',
  [[Comparison:({'field': 'age', 'operator': '>=', 'value': 25}),
    'AND',
    Comparison:({'field': 'gender', 'operator': '=', 'value': 'M'})],
   'OR',
   [Comparison:({'field': 'age', 'operator': 'between', 'value': [20, 30]}),
    'AND',
    Comparison:({'field': 'gender', 'operator': '=', 'value': 'F'})],
   'OR',
   [Comparison:({'field': 'age', 'operator': '>=', 'value': 70}),
    'AND',
    Comparison:({'field': 'eyes', 'operator': '!=', 'value': 'blue'})]]]]

The SimpleBool.py example has more details to show you how to create this class, and related classes for NOT, AND, and OR operators.

EDIT:

"Is there a way to return RESULT with dictionaries and not ComparisonExpr instances?" The __repr__ method on your ComparisonExpr class is being called instead of __str__. Easiest solution is to add to your class:

__repr__ = __str__

Or just rename __str__ to __repr__.

"The only thing unknown left is for me to turn 'true' into True and '[20,30]' into [20, 30]"

Try:

CK = CaselessKeyword  # 'cause I'm lazy
bool_literal = (CK('true') | CK('false')).setParseAction(lambda t: t[0] == 'true')
LBRACK,RBRACK = map(Suppress, "[]")
# parse numbers using pyparsing_common.number, which includes the str->int conversion parse action
num_list = Group(LBRACK + delimitedList(pyparsing_common.number) + RBRACK)

Then add these to your VALUE expression:

VALUE = bool_literal | num_list | Word(unicode_printables)

Lastly:

from pprint import pprint
pprint(RESULT)

I got so tired of importing pprint all the time to do just this, I just added it to the API for ParseResults. Try:

RESULT.pprint()  # no import required on your part

or

print(RESULT.dump()) # will also show indented list of named fields

EDIT2

LASTLY, results names are good to learn. If you make this change to COMPARISON, everything still works as you have it:

COMPARISON = FIELD('field') + OPERATOR('operator') + VALUE('value')

But now you can write:

def asDict(self):
    return self.tokens.asDict()

And you can access the parsed values by name instead of index position (either using result['field'] notation or result.field notation).

like image 198
PaulMcG Avatar answered Nov 15 '22 10:11

PaulMcG