I am working on a very simple "querying syntax" usable by people with reasonable technical skills (i.e., not coders per se, but able to touch on the subject)
A typical example of what they would enter on a form is:
address like street
AND
vote = True
AND
(
(
age>=25
AND
gender = M
)
OR
(
age between [20,30]
AND
gender = F
)
OR
(
age >= 70
AND
eyes != blue
)
)
With
I am using pyparsing (well, trying to anyway) and reaching something:
from pyparsing import *
OPERATORS = [
'<',
'<=',
'>',
'>=',
'=',
'!=',
'like'
'regexp',
'between'
]
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
# user_input is the text sent by the client form
user_input = ' '.join(user_input.split())
user_input = '(' + user_input + ')'
AND = Keyword("AND").setName('AND')
OR = Keyword("OR").setName('OR')
FIELD = Word(alphanums).setName('FIELD')
OPERATOR = oneOf(OPERATORS).setName('OPERATOR')
VALUE = Word(unicode_printables).setName('VALUE')
CRITERION = FIELD + OPERATOR + VALUE
QUERY = Forward()
NESTED_PARENTHESES = nestedExpr('(', ')')
QUERY << ( CRITERION | AND | OR | NESTED_PARENTHESES )
RESULT = QUERY.parseString(user_input)
RESULT.pprint()
The output is:
[['address',
'like',
'street',
'AND',
'vote',
'=',
'True',
'AND',
[['age>=25', 'AND', 'gender', '=', 'M'],
'OR',
['age', 'between', '[20,30]', 'AND', 'gender', '=', 'F'],
'OR',
['age', '>=', '70', 'AND', 'eyes', '!=', 'blue']]]]
Which I am only partially happy with - the main reason being that the desired final output would look like this:
[
{
"field" : "address",
"operator" : "like",
"value" : "street",
},
'AND',
{
"field" : "vote",
"operator" : "=",
"value" : True,
},
'AND',
[
[
{
"field" : "age",
"operator" : ">=",
"value" : 25,
},
'AND'
{
"field" : "gender",
"operator" : "=",
"value" : "M",
}
],
'OR',
[
{
"field" : "age",
"operator" : "between",
"value" : [20,30],
},
'AND'
{
"field" : "gender",
"operator" : "=",
"value" : "F",
}
],
'OR',
[
{
"field" : "age",
"operator" : ">=",
"value" : 70,
},
'AND'
{
"field" : "eyes",
"operator" : "!=",
"value" : "blue",
}
],
]
]
Many thanks!
EDIT
After Paul's answer, this is what the code looks like. Obviously it works much more nicely :-)
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
user_input = ' '.join(user_input.split())
AND = oneOf(['AND', '&'])
OR = oneOf(['OR', '|'])
FIELD = Word(alphanums)
OPERATOR = oneOf(OPERATORS)
VALUE = Word(unicode_printables)
COMPARISON = FIELD + OPERATOR + VALUE
QUERY = infixNotation(
COMPARISON,
[
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT,),
]
)
class ComparisonExpr:
def __init__(self, tokens):
self.tokens = tokens
def __str__(self):
return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens.asList())
COMPARISON.addParseAction(ComparisonExpr)
RESULT = QUERY.parseString(user_input).asList()
print type(RESULT)
from pprint import pprint
pprint(RESULT)
The output is:
[
[
<[snip]ComparisonExpr instance at 0x043D0918>,
'AND',
<[snip]ComparisonExpr instance at 0x043D0F08>,
'AND',
[
[
<[snip]ComparisonExpr instance at 0x043D3878>,
'AND',
<[snip]ComparisonExpr instance at 0x043D3170>
],
'OR',
[
[
<[snip]ComparisonExpr instance at 0x043D3030>,
'AND',
<[snip]ComparisonExpr instance at 0x043D3620>
],
'AND',
[
<[snip]ComparisonExpr instance at 0x043D3210>,
'AND',
<[snip]ComparisonExpr instance at 0x043D34E0>
]
]
]
]
]
Is there a way to return RESULT with dictionaries and not ComparisonExpr
instances?
EDIT2
Came up with a naive and very specific solution, but which works for me so far:
[snip]
class ComparisonExpr:
def __init__(self, tokens):
self.tokens = tokens
def __str__(self):
return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens.asList())
def asDict(self):
return {
"field": self.tokens.asList()[0],
"operator": self.tokens.asList()[1],
"value": self.tokens.asList()[2]
}
[snip]
RESULT = QUERY.parseString(user_input).asList()[0]
def convert(list):
final = []
for item in list:
if item.__class__.__name__ == 'ComparisonExpr':
final.append(item.asDict())
elif item in ['AND', 'OR']:
final.append(item)
elif item.__class__.__name__ == 'list':
final.append(convert(item))
else:
print 'ooops forgotten something maybe?'
return final
FINAL = convert(RESULT)
pprint(FINAL)
Which outputs:
[{'field': 'address', 'operator': 'LIKE', 'value': 'street'},
'AND',
{'field': 'vote', 'operator': '=', 'value': 'true'},
'AND',
[[{'field': 'age', 'operator': '>=', 'value': '25'},
'AND',
{'field': 'gender', 'operator': '=', 'value': 'M'}],
'OR',
[[{'field': 'age', 'operator': 'BETWEEN', 'value': '[20,30]'},
'AND',
{'field': 'gender', 'operator': '=', 'value': 'F'}],
'AND',
[{'field': 'age', 'operator': '>=', 'value': '70'},
'AND',
{'field': 'eyes', 'operator': '!=', 'value': 'blue'}]]]]
Again thanks to Paul for pointing me if a right direction!
The only thing unknown left is for me to turn 'true'
into True
and '[20,30]'
into [20, 30]
.
nestedExpr
is a convenience expression in pyparsing, to make it easy to define text with matched opening and closing characters. When you want to parse the nested contents, then nestedExpr
is usually not well structured enough.
The query syntax you are trying to parse is better served using pyparsing's infixNotation
method. You can see several examples at the pyparsing wiki's Examples page - SimpleBool is is very similar to what you are parsing.
"Infix notation" is a general parsing term for expressions where the operator is in between its related operands (vs. "postfix notation" where the operator follows the operands, as in "2 3 +" instead of "2 + 3"; or "prefix notation" which looks like "+ 2 3"). Operators can have an order of precedence in evaluation that can override left-to-right order - for instance, in "2 + 3 * 4", precedence of operations dictates that multiplication gets evaluated before addition. Infix notation also supports using parentheses or other grouping characters to override that precedence, as in "(2 + 3) * 4" to force the addition operation to be done first.
pyparsing's infixNotation
method takes a base operand expression, and then a list of operator definition tuples, in order of precedence. For instance, 4-function integer arithmetic would look like:
parser = infixNotation(integer,
[
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])
Meaning that we will parse integer operands, with '*' and '/' binary left-associative operations and '+' and '-' binary operations, in that order. Support for parentheses to override the order is built into infixNotation
.
Query strings are often some combination of boolean operations NOT, AND, and OR, and typically evaluated in that order of precedence. In your case, the operands for these operators are comparison expressions, like "address = street" or "age between [20,30]". So if you define an expression for a comparison expression, of the form fieldname operator value
, then you can use infixNotation
to do the right grouping of AND's and OR's:
import pyparsing as pp
query_expr = pp.infixNotation(comparison_expr,
[
(NOT, 1, pp.opAssoc.RIGHT,),
(AND, 2, pp.opAssoc.LEFT,),
(OR, 2, pp.opAssoc.LEFT,),
])
Finally, I suggest you define a class to take the comparison tokens as class init args, then you can attach behavior to that class to evaluate the comparisons and output debug strings, something like:
class ComparisonExpr:
def __init__(self, tokens):
self.tokens = tokens
def __str__(self):
return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(
*self.tokens.asList())
# attach the class to the comparison expression
comparison_expr.addParseAction(ComparisonExpr)
Then you can get output like:
query_expr.parseString(sample).pprint()
[[Comparison:({'field': 'address', 'operator': 'like', 'value': 'street'}),
'AND',
Comparison:({'field': 'vote', 'operator': '=', 'value': True}),
'AND',
[[Comparison:({'field': 'age', 'operator': '>=', 'value': 25}),
'AND',
Comparison:({'field': 'gender', 'operator': '=', 'value': 'M'})],
'OR',
[Comparison:({'field': 'age', 'operator': 'between', 'value': [20, 30]}),
'AND',
Comparison:({'field': 'gender', 'operator': '=', 'value': 'F'})],
'OR',
[Comparison:({'field': 'age', 'operator': '>=', 'value': 70}),
'AND',
Comparison:({'field': 'eyes', 'operator': '!=', 'value': 'blue'})]]]]
The SimpleBool.py example has more details to show you how to create this class, and related classes for NOT, AND, and OR operators.
EDIT:
"Is there a way to return RESULT with dictionaries and not ComparisonExpr instances?"
The __repr__
method on your ComparisonExpr
class is being called instead of __str__
. Easiest solution is to add to your class:
__repr__ = __str__
Or just rename __str__
to __repr__
.
"The only thing unknown left is for me to turn 'true' into True and '[20,30]' into [20, 30]"
Try:
CK = CaselessKeyword # 'cause I'm lazy
bool_literal = (CK('true') | CK('false')).setParseAction(lambda t: t[0] == 'true')
LBRACK,RBRACK = map(Suppress, "[]")
# parse numbers using pyparsing_common.number, which includes the str->int conversion parse action
num_list = Group(LBRACK + delimitedList(pyparsing_common.number) + RBRACK)
Then add these to your VALUE expression:
VALUE = bool_literal | num_list | Word(unicode_printables)
Lastly:
from pprint import pprint
pprint(RESULT)
I got so tired of importing pprint
all the time to do just this, I just added it to the API for ParseResults
. Try:
RESULT.pprint() # no import required on your part
or
print(RESULT.dump()) # will also show indented list of named fields
EDIT2
LASTLY, results names are good to learn. If you make this change to COMPARISON, everything still works as you have it:
COMPARISON = FIELD('field') + OPERATOR('operator') + VALUE('value')
But now you can write:
def asDict(self):
return self.tokens.asDict()
And you can access the parsed values by name instead of index position (either using result['field']
notation or result.field
notation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With