Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a simple text grammar with Superpower

I'm trying to create a parser with Superpower. I have already taken a look to the samples I've found in the repo, but they are a bit difficult to understand, at least for a beginner like me :) So I came with this little challenge.

I have invented a very basic grammar just to learn. I have thought of an elevator that follows a list of instructions to go up, down and wait.

Example:

(UP 100),
(DOWN 200),
(DOWN 100),
(DOWN @1),
(UP @3),
(WAIT),
(UP 300)

As you see, it consists of a list of comma-separated verbs to move, for example, an elevator.

  • The verbs are UP, DOWN or WAIT.
  • Every verb is enclosed by parentheses: ( )
  • UP and DOWN require either an absolute number or a relative number, that indicates the floor to which the elevator should move. Relative floor numbers come with a @ before the number.
  • WAIT doesn't accept any number, because it stops the elevator for a while.

I really would like to learn how to create a token based parser for this grammar as a start in order to understand how to use SuperPower.

like image 378
SuperJMN Avatar asked Dec 10 '17 16:12

SuperJMN


2 Answers

Step 1 of writing any Superpower parser is to figure out what the token kinds are. You have something like:

// ECL - Elevator Control Language ;-)
enum EclToken {
    LParen,
    RParen,
    UpKeyword,
    DownKeyword,
    WaitKeyword,
    AtSymbol,
    Number,
    Comma
}

Step 2, write a Tokenizer<EclToken>. This is left as a direct programming task by Superpower v1 - there aren't many helpers to lean on, you just need to write the code as in the examples.

The tokenizer takes the input string, strips out the whitespace, and figures out what the sequence of tokens is.

For your example input, the first line will be:

// (UP 100),
LParen, UpKeyword, Number, RParen, Comma

For tokens like Number that contain content, the span associated with the Result<EclToken> will point to the portion of the input string corresponding to the token. In this line, the number will be a TextSpan covering 100.

Step 3 is to figure out what you want to parse the input into. For programming languages with nested expressions, this is usually an AST. In the case of the ECL sample, it's pretty simple so you might cut it down to:

struct ElevatorCommand {        
    public int Distance; // + or -
    public bool IsRelative;
}

Step 4, the parser. This is usually embedded in a static class. The job of the parser is to build up more complex results (an ElevatorCommand[], here), from simpler results (number, movement).

This is where Superpower does the heavy lifting, particularly with respect to expectations and errors.

static class EclParser 
{
    static TokenListParser<EclToken, int> Number =
        Token.EqualTo(EclToken.Number).Apply(Numerics.IntegerInt32);
}

The first thing we do is define the parser for numbers; this one applies a built-in TextParser<int> to the content of an EclToken.Number span.

You can see more parsing machinery in this example.

A few more clues to help you find the way (not syntax checked, let alone compiled/tested):

    static TokenListParser<EclToken, ElevatorCommand> Up =
        from _ in Token.EqualTo(EclToken.UpKeyword)
        from distance in Number
        select new ElevatorCommand {
            Distance = distance,
            IsRelative = false
        };

    static TokenListParser<EclToken, ElevatorCommand> Command =
        from lp in Token.EqualTo(EclToken.LParen)
        from command in Up // .Or(Down).Or(Wait)
        from rp in Token.EqualTo(EclToken.RParen)
        select command;

    static TokenListParser<EclToken, ElevatorCommand[]> Commands =
        Command.ManyDelimitedBy(Token.EqualTo(EclToken.Comma));
}

Commands is the completed parser that you can apply to the input.

It's best to build up the parser incrementally, testing each smaller parser on chunks of input they're expected to parse.

like image 95
Nicholas Blumhardt Avatar answered Nov 02 '22 03:11

Nicholas Blumhardt


OK, I have finally managed to get it. It wasn't so difficult with @Nicholas Blumhardt's guidance :)

I have created a project in GitHub to illustrate the scenario. Since the classes are big for a post, I'm linking to the files:

  • This is the tokenizer
  • This is the class with the parsers.
like image 37
SuperJMN Avatar answered Nov 02 '22 02:11

SuperJMN