Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is my recursive descent parser right-associative

I'm writing my own programming language, and I have the tokenizer (lexer) all done. But for parsing, I'm having trouble writing a recursive descent parser. It seems to be right associative, when it should be left, and I don’t know why. For example, it parses 1-2-3 as 1-(2-3), not the correct (1-2)-3.

I've trimmed off most of the other code leaving just what's reproducible:

using System.Collections.Generic;

namespace Phi
{
    public enum TokenType
    {
        Plus, // '+'
        Minus, // '-'
        IntegerLiteral,
    }

    public interface INode
    {
        // Commented out as they aren't relevant
        //NodeType GetNodeType();
        //void Print(string indent, bool last);
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<Token> tokens = new List<Token>()
            {
                new Token(TokenType.IntegerLiteral, "1"),
                new Token(TokenType.Minus, ""),
                new Token(TokenType.IntegerLiteral, "2"),
                new Token(TokenType.Minus, ""),
                new Token(TokenType.IntegerLiteral, "3"),
            };

            int consumed = ParseAdditiveExpression(tokens, out INode root);
        }

        private static int ParseAdditiveExpression(List<Token> block, out INode node)
        {
            // <additiveExpr> ::= <multiplicativeExpr> <additiveExprPrime>
            int consumed = ParseMultiplicataveExpression(block, out INode left);
            consumed += ParseAdditiveExpressionPrime(GetListSubset(block, consumed), out INode right);

            if (block[1].Type == TokenType.Plus)
                node = (right == null) ? left : new AdditionNode(left, right);
            else
                node = (right == null) ? left : new SubtractionNode(left, right);
            return consumed;
        }
        private static int ParseAdditiveExpressionPrime(List<Token> block, out INode node)
        {
            // <additiveExprPrime> ::= "+" <multiplicataveExpr> <additiveExprPrime>
            //                     ::= "-" <multiplicativeExpr> <additiveExprPrime>
            //                     ::= epsilon
            node = null;
            if (block.Count == 0)
                return 0;
            if (block[0].Type != TokenType.Plus && block[0].Type != TokenType.Minus)
                return 0;

            int consumed = 1 + ParseMultiplicataveExpression(GetListSubset(block, 1), out INode left);
            consumed += ParseAdditiveExpressionPrime(GetListSubset(block, consumed), out INode right);

            if (block[0].Type == TokenType.Plus)
                node = (right == null) ? left : new AdditionNode(left, right);
            else
                node = (right == null) ? left : new SubtractionNode(left, right);
            return consumed;
        }

        private static int ParseMultiplicataveExpression(List<Token> block, out INode node)
        {
            // <multiplicativeExpr> ::= <castExpr> <multiplicativeExprPrime>
            // unimplemented; all blocks are `Count == 1` with an integer
            node = new IntegerLiteralNode(block[0].Value);
            return 1;
        }

        private static List<T> GetListSubset<T>(List<T> list, int start)
        {
            return list.GetRange(start, list.Count - start);
        }
    }
}

As for the definition of AdditionNode, SubtractionNode, and MultiplicationNode, they are all the same and only serve semantic purposes. For brevity, here's just AdditionNode:

namespace Phi
{
    public class AdditionNode : INode
    {
        public AdditionNode(INode left, INode right)
        {
            Left = left;
            Right = right;
        }

        public INode Left { get; }
        public INode Right { get; }

        // Print and GetNodeType have been removed as they aren't relevant
    }
}

As for my problem, when I run Phi.Program, as said at the beginning, it is parsing with the wrong associativity. Here is root after ParseAdditiveExpression completes:

enter image description hereenter image description hereenter image description here

As you can see, it groups the 2 with the 3, not the 1. Why is it doing this?

like image 620
Cole Tobin Avatar asked Nov 30 '22 08:11

Cole Tobin


1 Answers

As I noted in a comment, the problem is that you have confused the rightmost child of a binary operator with the rightmost child of an additiveprime. The rightmost child of a binary operator is an expression. The rightmost side of an additiveprime is an additiveprime, so on "tree node type" grounds alone we must conclude that you have built a wrong parse tree.

Keeping track of the "logical type" of every parse artifact is a powerful technique for finding bugs in parsers. Another that I like, which is tragically underused, is attribute every token in the program to exactly one parse tree node. If you did that then you would quickly realize that the token for the operator is logically in two places: in the binary operator, and in its rightmost child. This also tells us that something is wrong.

What doesn't help is that your infrastructure for parsing is a godawful mess of passing around numbers and out parameters. Your parser lacks discipline. Your parser code looks like counting tokens is the most important thing that a parser does, and that everything else is incidental.

Parsing is a very crisp problem, and parser methods should do one thing, one thing only, and do it perfectly. The structure of the parser, and the structure of each method, should directly reflect the grammar being parsed. There should be almost no arithmetic on integers in the parser, since parsing is about building a parse tree, not about counting tokens.

I build recursive descent parsers for a living. Let me show you how I would build this parser, were I building it quickly for my own purposes. (Were I building it for a production application it would be different in many respects, but we're going for easy to understand here.)


All right, let's get started. First thing is: when you are stuck on a problem, solve a simpler problem. Let's simplify the problem in the following manner:

  • Assume that the token stream is a well-formed program. No error detection.
  • Tokens are strings.
  • The grammar is : E ::= T E', E' ::= + T E' | nil, and T is a term consisting of a single token.

All right. Start by creating types that represent each of those things.

sealed class Term : ParseTree 
{
    public string Value { get; private set; }
    public Term(string value) { this.Value = value; }
    public override string ToString() { return this.Value; }
}
sealed class Additive : ParseTree 
{ 
    public ParseTree Term { get; private set; }
    public ParseTree Prime { get; private set; }
    public Additive(ParseTree term, ParseTree prime) {
        this.Term = term;
        this.Prime = prime;
    }
    public override string ToString() { return "" + this.Term + this.Prime; }
}
sealed class AdditivePrime : ParseTree 
{ 
    public string Operator { get; private set; }
    public ParseTree Term { get; private set; }
    public ParseTree Prime { get; private set; }
    public AdditivePrime(string op, ParseTree term, ParseTree prime) {
        this.Operator = op;
        this.Term = term;
        this.Prime = prime;
    }
    public override string ToString() { return this.Operator + this.Term + this.Prime; }
}
sealed class Nil : ParseTree 
{
    public override string ToString() { return ""; }
}

Notice a few things:

  • Abstract classes are abstract.
  • Concrete classes are sealed.
  • Everything is immutable.
  • Everything knows how to print itself.
  • No nulls! NO NULLS. Nulls cause crashes. You have a production called nil, so make a type called Nil to represent it.

Next: What do we want the parser to look like from the user's perspective? We want a sequence of tokens to go in, and we want a parse tree to come out. Great. So the public surface should be:

sealed class Parser
{
    public Parser(List<string> tokens) { ... }
    public ParseTree Parse() { ... }
}

And if we've done everything right then the call site looks like this:

public static void Main()
{
    var tokens = new List<string>() { "1" , "+" , "2" , "+" , "3" , "+" , "4"};
    var parser = new Parser(tokens);
    var result = parser.Parse();
    System.Console.WriteLine(result);
}

Super. Now all we have to do is implement it.

A parser keeps track of the list of tokens and the current token under consideration. Do not tramp that information around from method to method. It is logically part of the parser, so keep it in the parser.

public sealed class Parser
{
    private List<string> tokens;
    private int current;    
    public Parser(List<string> tokens)
    {
        this.tokens = tokens;
        this.current = 0;
    }

The language consists right now only of additive expressions, so:

    public ParseTree Parse()
    {
        return ParseAdditive();
    }

Great, we are done two members of the parser already. Now, what does ParseAdditive do? It does what it says on the tin. It parses an additive expression, which has the grammar E:: T E', so that is what it does and ALL that it does, for now.

private ParseTree ParseAdditive()
{
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return new Additive(term, prime);
}

If your parser methods do not look incredibly simple then you are doing something wrong. The whole point of recursive descent parsers is that they are easy to understand and easy to implement.

Now we can see how to implement ParseTerm(); it just consumes a token:

private string Consume() 
{
  var t = this.tokens[this.current];
  this.current += 1;
  return t;
}
private ParseTree ParseTerm() {
  return new Term(Consume());
}

Again, we are assuming that the token stream is well formed. Of course this would crash if it was ill-formed, but that's a problem for another day.

And finally, the last one is a little harder because there are two cases.

private bool OutOfTokens() 
{
  return this.current >= this.tokens.Count;
}
private ParseTree ParseAdditivePrime()
{
    if (OutOfTokens())
        return new Nil();
    var op = Consume();
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return new AdditivePrime(op, term, prime);
}

So simple. Again, all your methods should look like exactly what they do.

Notice that I did NOT write

private ParseTree ParseAdditivePrime()
{
    if (this.current >= this.tokens.Count)
        return new Nil();

Keep the text of the program reading like the operations it is implementing. What do we want to know? Are we out of tokens? So say that. Don't make the reader -- yourself -- have to think about oh, did I mean > or < or <= or... just don't. Solve the problem once, put the solution in a method with a good name, and then call the method. Your future self will thank your past self for taking that care.

Notice also that I did not write the C# 7 super slick:

private ParseTree ParseAdditivePrime() => 
  OutOfTokens() ? new Nil() : new AdditivePrime(Consume(), ParseTerm(), ParseAdditivePrime());

That you can write your parser methods as one-liners is a sign that you've designed a good parser, but that doesn't mean that you should. It's often easier to understand and debug a parser if you keep it in sequential statement form rather than slick little one-liners. Exercise good judgment.


OK, we have solved the easier problem! Now let's solve a slightly harder problem. We have solved parsing the grammar E ::= T E', E' ::= + T E' | nil, but the grammar we want to parse is B :: T | B + T.

Notice that I am not confusing E, which is a term-and-suffix, with B, which is either a T or a B, a +, and a T. Since B and E are different, represent them by different types.

Let's make a type for B:

sealed class Binary : ParseTree 
{
    public ParseTree Left { get; private set; }
    public string Operator { get; private set; }
    public ParseTree Right { get; private set; }
    public Binary(ParseTree left, string op, ParseTree right) 
    {
        this.Left = left; 
        this.Operator = op;
        this.Right = right;
    }
    public override string ToString() 
    {
        return "(" + Left + Operator + Right + ")";
    }
}

Note that I have added parentheses in the output as a visual aid to help us see that it is left associative.

Now, suppose we have an Additive in hand and we need a Binary. How are we going to do that?

An additive is always a term and a prime. So there are two cases. Either the prime is nil, or it is not.

If the prime is nil then we're done: a Term is acceptable where a Binary is required, so we can just pass back the term.

If the prime is not nil then the prime is op, term, prime. Somehow we have to get a Binary out of that. A binary needs three things. Remember, we are attributing every token to exactly one node, so this will help us figure it out.

  • We have our left term from the additive.
  • We have our op from the prime.
  • We have our right term from the prime.

But that leaves the prime's prime behind! We need to do something with that. Let's reason it out. Again, there are two cases:

  • If the prime's prime is nil, then the result is the binary.
  • If the prime's prime is not nil, then the result is a new binary with the old binary on the left, and... wait a minute, this is the same algorithm that we just described for transforming an additive into a binary.

We have just discovered that this algorithm is recursive, and once you realize that it is trivial to write:

private static ParseTree AdditiveToBinary(ParseTree left, ParseTree prime) 
{
    if (prime is Nil) return left;
    var reallyprime = (AdditivePrime) prime;
    var binary = new Binary(left, reallyprime.Operator, reallyprime.Term);
    return AdditiveToBinary(binary, reallyprime.Prime);
}

And now we modify our ParseAdditive:

private ParseTree ParseAdditive()
{
    var term = ParseTerm();
    var prime = ParseAdditivePrime();
    return AdditiveToBinary(term, prime);       
}     

And run it:

(((1+2)+3)+4)

And we're done.

Well, not quite. ParseAdditive no longer does what it says on the tin! It says ParseAdditive but it returns a Binary.

In fact... do we need Additive at all? Could we eliminate it entirely from the parser? In fact we could; we now never create an instance of Additive, so it can be deleted, and ParseAdditive can be renamed ParseBinary.

This often happens when constructing programs using the technique of "solve a simpler problem". You end up being able to throw away your earlier work, which is great. Deleted code has no bugs.

  • Exercise: Representing operators as strings is gross. Make an Operator class analogous to Term, and a method that parses operators. Keep raising the level of abstraction of the parser away from concrete strings and into the business domain of the parser. Similarly with tokens; they ought not to be strings.
  • Exercise: We've solved a slightly harder problem, so keep going. Can you now add multiplication?
  • Exercise: Can you parse a language that has a mix of left and right associative operators?

Some additional thoughts:

  • I presume you are doing this either for your own fun, or for a school assignment. Do not copy paste my work into your assignment. That's academic fraud. Remember to properly attribute all work when you submit it if that work is not entirely your own.

  • If you're doing it for fun, have fun designing a language! It's a nice hobby, and if you're really lucky, someday someone will pay you to do it.

  • You're designing your own language, so you do not have to repeat the mistakes of the past. I notice for example that your comments suggest that you're going to add cast expressions. Welcome to a world of pain if you do it like C, C++, C#, Java, and so on. All those languages must have their parsers disambiguate between (x)+y meaning "apply unary plus to y and cast the thing to x", and "add quantity (x) to y". It is a major pain. Consider a better syntax for casts, like an as operator. Also, examine the meaning of a cast; in C# a cast means both "produce an instance of a different type that represents the same value" and "I am asserting that the runtime type of this thing differs from its compile time type; throw if I am wrong". Those operations are completely different but they have the same syntax. All languages are responses to previous languages, so think hard about what you like because it is familiar vs because it is good.

like image 172
Eric Lippert Avatar answered Dec 05 '22 07:12

Eric Lippert