Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String expression parsing tips?

I got bored during the holiday season this year and randomly decided to write a simple list comprehension/filtering library for Java (I know there are some great ones out there, I just wanted to write it my self for the hell of it).

For this list:

LinkedList<Person> list = new LinkedList<Person>();
            list.add(new Person("Jack", 20));
            list.add(new Person("Liz", 58));
            list.add(new Person("Bob", 33));

Syntax is:

Iterable<Person> filtered = Query.from(list).where(
    Condition.ensure("Age", Op.GreaterEqual, 21)
    .and(Condition.ensure("Age", Op.LessEqual, 50));

I know its ugly, but if I use static imports and use shorter method names it becomes pretty concise.

The following syntax is the ultimate goal:

Iterable<Person> list2 = Query.from(list).where("x=> x.Age >= 21 & x.Age <= 50");

Apparently expression parsing is not my strongest area, im having trouble with parsing nested/multiple conditionals. Anyone know of some resources/literature I might find helpful?

I've only got single conditional expressions being sucessfully parsed from String lambda syntax at the moment: "x=> x.Name == Jack". My underlying Expression structure is fairly solid and can easily handle any amount of nesting, the issue is just the expression parsing from a string.

Thanks

Just for kicks, here is a little insight as to how the expression structure behind the scenes can work (obviously I could have specified 'op.GreaterEqual', etc... in the following example, but I wanted to demonstrate how it is flexible to any amount of nesting):

Condition minAge1 = Condition.ensure("Age", Op.Equal, 20);
Condition minAge2 = Condition.ensure("Age", Op.Greater, 20);
Expression minAge = new Expression(minAge1, Express.Or, minAge2);
Expression maxAge = Condition.ensure("Age", Op.Equal, 50).or(Condition.ensure("Age", Op.Less, 50));
Expression ageExpression = new Expression(minAge, Express.And, maxAge);

Condition randomException = Condition.ensure("Name", Op.Equal, "Liz");
Expression expressionFinal = new Expression(ageExpression, Express.Or, randomException);
like image 877
jdc0589 Avatar asked Jan 17 '10 07:01

jdc0589


3 Answers

Basically what you'll want is a recursive descent parser for expressions. It's a topic featured heavily in compiler theory so any book on compilers will cover the topic. In formal grammar terms it will look something like this:

condition  : orAtom ('||' orAtom)+ ;
orAtom     : atom ('&&' atom)+ ;
atom       : '(' condition ')'
           | expression ;
expression : value OPER value ;
value      : VARIABLE | LITERAL '
VARIABLE   : (LETTER | '_') (LETTER | DIGIT | '_')* ;
LITERAL    : NUMBER
           | STRING ;
NUMBER     : '-'? DIGIT+ ('.' DIGIT+)? ;
STRING     : '"' . CHAR* . '"' '
CHAR       : ('\\' | '\"' | .) + ;
LETTER     : 'a'..'z' | 'A'..'Z' ;
DIGIT      : '0'..'9' ;
OPER       : '>' | '>=' | '<' | '<=' | '=' | '!=' ;

The grammar above is (mostly) in ANTLR form as that what I'm most familiar with.

Parsing boolean or arithmetic expressions is a classic introductory topic when dealing with parsing so you should be able to find plenty of literature on it. If you want to pursue ANTLR (since you're using Java) I'd highly suggest reading The Definitive ANTLR Reference: Building Domain-Specific Languages.

If all this looks like overkill and all a bit much to take in, you might be right. It's a tough topic to get started in.

One alternative you have is not to create an arbitrary string expression but instead use a fluent interface (like you're doing):

List results = from(source)
  .where(var("x").greaterThan(25), var("x").lessThan(50))
  .select("field1", "field2");

as that is stating the expression tree in code and should be easier to implement.

like image 128
cletus Avatar answered Nov 02 '22 18:11

cletus


To add to cletus answer, you will first want to define your grammer.

The following expression grammer works pretty well for most cases, unfortunately, normal recursive descent doesn't allow you to define the recursive part first in each production. This would cause you to call the production method recursively until you get a stack overflow.

        orexpr  ::=  orexpr '|' andexpr  
                  |  andexpr  

        andexpr  ::=  andexpr '&' comparison
                   |  comparison

        comparison  ::= addexpr compareOp addexpr
                     |  addexpr

        addexpr  ::=  addexpr '+' mulexpr
                   |  addexpr '-' mulexpr
                   |  mulexpr

        mulexpr  ::=  mulexpr '*' value
                   |  mulexpr '/' value
                   |  mulexpr '%' value
                   |  value

        value    ::=  integer
                   |  float
                   |  variable
                   |  quotation
                   |  '(' orexpr ')'

Normal recursive descent would require you to define mulexpr, for example as:

         mulexpr ::= value '*' mulexpr  
                    | value '/' mulexpr
                    | value '%' mulexpr

But the problem with this grammer is that the expression tree will be built in such a way that your order of operations will all be in reverse.

Compromise: Use recursive descent in reverse on the original grammer written above. Parse the expression from right to left. Build your tree from right to left. It will preserve the order of operations.

In recursive descent you usually write a parse method for each production. The parseOr() method might look as follows:

 private MyExpression parseOr(MyScanner scanner) {
        MyExpression expression = null;

        MyExpression rightExpr = parseAnd(scanner);
        Token token = scanner.getCurrentToken();
        if (token.hasValue("|") {
            expression = new MyExpression();
            expression.setOperator(OR);
            Token nextToken = scanner.getNextToken(); // remember, this is scanning in reverse
            MyExpression leftExpression = parseOr(scanner);
            expression.setLeft(leftExpression);
            expression.setRight(rightExpression);
        }
        else {
            expression = rightExpression;
        }
        return expression;
    } 

like image 35
rayd09 Avatar answered Nov 02 '22 19:11

rayd09


Thanks for all the tips guys. I decided most of this was way more than I needed, so I ended up regexing the hell out it to get things in to manageable groups which I could easily parse in 20-30 lines of code.

Ive got the string LambdaExpression interface working almost as well as the fluent interface, just one or two small bug.

I will probably keep developing it a little just for fun, but it's obviously too inefficient to really use since its about 90% reflection based.

like image 23
jdc0589 Avatar answered Nov 02 '22 18:11

jdc0589