Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confusion in jparsec

Tags:

java

jparsec

I'm attempting to use jparsec to define and utilize my fairly simple grammar, but am completely confused about how to go about it. I don't know at this point whether it's my inadequate understanding of the problem space, or jparsec's sparse and uninformative documentation. Or both.

I have a grammer something like this:

foo='abc' AND bar<>'def' OR (biz IN ['a', 'b', 'c'] AND NOT baz = 'foo')

So you can see it supports operators such as AND, OR, NOT, IN, =, <>. It also supports arbitrarily nested parentheses to dictate precedence.

I think I got fairly far with tokenizing. Here's what I have:

public final class NewParser {
    // lexing
    private static final Terminals OPERATORS = Terminals.operators("=", "OR", "AND", "NOT", "(", ")", "IN", "[", "]", ",", "<>");
    private static final Parser<?> WHITESPACE = Scanners.WHITESPACES;
    private static final Parser<?> FIELD_NAME_TOKENIZER = Terminals.Identifier.TOKENIZER;
    private static final Parser<?> QUOTED_STRING_TOKENIZER = Terminals.StringLiteral.SINGLE_QUOTE_TOKENIZER.or(Terminals.StringLiteral.DOUBLE_QUOTE_TOKENIZER);
    private static final Parser<?> IGNORED = Parsers.or(Scanners.WHITESPACES).skipMany();
    private static final Parser<?> TOKENIZER = Parsers.or(OPERATORS.tokenizer(), WHITESPACE, FIELD_NAME_TOKENIZER, QUOTED_STRING_TOKENIZER).many();

    @Test
    public void test_tokenizer() {
        Object result = TOKENIZER.parse("foo='abc' AND bar<>'def' OR (biz IN ['a', 'b', 'c'] AND NOT baz = 'foo')");
        Assert.assertEquals("[foo, =, abc, null, AND, null, bar, <>, def, null, OR, null, (, biz, null, IN, null, [, a, ,, null, b, ,, null, c, ], null, AND, null, NOT, null, baz, null, =, null, foo, )]", result.toString());
    }
}

test_tokenizer passes, so I think it's working OK.

Now, I already have a type hierarchy that represents the syntax. For example, I have classes called Node, BinaryNode, FieldNode, LogicalAndNode, ConstantNode et cetera. And what I'm trying to do is create a Parser that takes my tokens and spits out a Node. And this is where I keep getting stuck.

I thought I'd start with something really simple like this:

private static Parser<FieldNode> fieldNodeParser =
    Parsers.sequence(FIELD_NAME_TOKENIZER)
    .map(new Map<Object, FieldNode>() {
        @Override
        public FieldNode map(Object from) {
            Fragment fragment = (Fragment)from;
            return new FieldNode(fragment.text());
        }
    });

I thought I'd be able to do this:

public static Parser<Node> parser = fieldNodeParser.from(TOKENIZER);

But that gives me a compile error:

The method from(Parser<? extends Collection<Token>>) in the type Parser<FieldNode> is not applicable for the arguments (Parser<capture#6-of ?>)

So it looks like my generics are scewed somewhere, but I have no idea where or how to fix this. I'm not even certain I'm going about this in the right fashion. Can anyone enlighten me?

like image 712
me-- Avatar asked Oct 15 '12 16:10

me--


1 Answers

You are mixing two different levels of "parsers": String-level parsers aka. scanners or lexers, and token-level parsers. This is how JParsec implements the traditional separation of lexical and syntactic analysis.

To make your code compile cleanly, you can add a call to .cast() method at end of parser's definition, but this will not fix your problem as the next error you will have will be something like cannot run a character-level parser at token level. This problem comes from the use of .from() to define your top-level parser which implicitly sets the boundary between the two worlds.

Here is a working implementation (and unit tests) for your parser:

public class SampleTest {


private static Parser<FieldNode> fieldNodeParser = Parsers.sequence(Terminals.fragment(Tokens.Tag.IDENTIFIER).map(new Map<String, FieldNode>() {
            @Override
            public FieldNode map(String from) {
                String fragment = from;
                return new FieldNode(fragment);
            }
        })).cast();

public static Parser<FieldNode> parser = fieldNodeParser.from(NewParser.TOKENIZER, Scanners.WHITESPACES);


@Test
public void test_tokenizer() {
    Object result = Parsers.or(NewParser.TOKENIZER, Scanners.WHITESPACES.cast()).many().parse("foo='abc' AND bar<>'def' OR (biz IN ['a', 'b', 'c'] AND NOT baz = 'foo')");
    Assert.assertEquals("[foo, =, abc, null, AND, null, bar, <>, def, null, OR, null, (, biz, null, IN, null, [, a, ,, null, b, ,, null, c, ], null, AND, null, NOT, null, baz, null, =, null, foo, )]", result.toString());
}

@Test
public void test_parser() throws Exception {
    FieldNode foo = parser.parse("foo");
    assertEquals(foo.text, "foo");
}

public static final class NewParser {
    // lexing
    static final Terminals OPERATORS = Terminals.operators("=", "OR", "AND", "NOT", "(", ")", "IN", "[", "]", ",", "<>");
    static final Parser<String> FIELD_NAME_TOKENIZER = Terminals.Identifier.TOKENIZER.source();
    static final Parser<?> QUOTED_STRING_TOKENIZER = Terminals.StringLiteral.SINGLE_QUOTE_TOKENIZER.or(Terminals.StringLiteral.DOUBLE_QUOTE_TOKENIZER);
    static final Terminals TERMINALS = Terminals.caseSensitive(new String[] { "=", "(", ")", "[", "]", ",", "<>" }, new String[] { "OR", "AND", "NOT", "IN" });
    static final Parser<?> TOKENIZER = Parsers.or(TERMINALS.tokenizer(), QUOTED_STRING_TOKENIZER);
}

private static class FieldNode {
    final String text;

    public FieldNode(String text) {

        this.text = text;
    }
}

}

What I changed is:

  • I use the Terminals.caseSensitive method to create a lexer for terminals only (keywords, operators and identifiers). The identifier lexer used is implicitly the one provided natively by jParsec (eg. Terminals.IDENTIFIER),
  • I use the .from() method with the TOKENIZER and WHITESPACES as separator,
  • The fieldNodeParser uses Terminals.fragment(...) to parse tokens and not characters.

Hope that helps, Arnaud

like image 127
insitu Avatar answered Oct 23 '22 06:10

insitu