Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Class hierarchy of tokens and checking their type in the parser

I'm attempting to write a reusable parsing library (for fun).

I wrote a Lexer class which generates a sequence of Tokens. Token is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties. For example, there is a subclass LiteralNumber (deriving from Literal and through it from Token), which has its own specific methods for dealing with numeric value of its lexeme. Methods for dealing with lexemes in general (retrieving their character string representation, position in the source etc.) are in the base class, Token, because they're general to all token types. Users of this class hierarchy can derive their own classes for specific token types not predicted by me.

Now I have a Parser class which reads the stream of tokens and tries to match them with its syntax definition. For example, it has a method matchExpression, which in turn calls matchTerm and this one calls matchFactor, which has to test if the current token is Literal or Name (both derived from Token base class).

The problem is:
I need to check now what is the type of the current token in the stream and whether it matches the syntax or not. If not, throw an EParseError exception. If yes, act accordingly to get its value in the expression, generate machine code, or do whatever the parser needs to do when the syntax matches.

But I've read much about that checking the type in runtime, and deciding from it, is a bad design™, and it should be refactored as polymorphic virtual methods. Of course, I agree with that.

So my first attempt was to put some type virtual method in the Token base class, which would be overrided by the derived classes and return some enum with type id.

But I already see a shortcomings of this approach: Users deriving from Token their own classes of tokens won't be able to add additional id's to the enum, which is in the library source! :-/ And the goal was to allow them for extending the hierarchy for new types of tokens when they'll need it.

I could also return some string from the type method, which would allow for easy defining new types.

But still, in both these cases, the information about base types is lost (only leaf type is returned from type method) and the Parser class wouldn't be able to detect the Literal derived type when someone would derive from it and override the type to return something other than "Literal".

And of course the Parser class, which also is meant for extending by users (that is, writing their own parsers, recognizing their own tokens and syntax) doesn't know what descendants of the Token class will be there in the future.

Many FAQs and books on design recommend in this scenario to take the behavior from the code which needs to decide by type, and put it into the virtual method overriden in derived classes. But I cannot imagine how could I put this behavior into the Token descendants, because it's not their busines, for example, to generate machine code, or evaluate expressions. Moreover, there are parts of the syntax which need to match more than one token, so there is no one particular token which I could put that behavior into. It's rather the responsibility of particular syntax rules, which could match more than one token as their terminal symbols.

Any ideas how to improve this design?

like image 441
SasQ Avatar asked Sep 09 '11 13:09

SasQ


Video Answer


1 Answers

RTTI is well supported by all major C++ compilers. This includes at least GCC, Intel and MSVC. The portability concerns are really a thing of the past.

If it is the syntax you don't like then here is a nice solution to pretty up RTTI:

class Base {
public:
  // Shared virtual functions
  // ...

  template <typename T>
  T *instance() {return dynamic_cast<T *>(this);}
};

class Derived : public Base {
  // ...
};

// Somewhere in your code
Base *x = f();

if (x->instance<Derived>()) ;// Do something

// or
Derived *d = x->instance<Derived>();

A common alternative to RTTI for a parser AST using virtual function overloading, without maintaining your own type enumeration, is to use the visitor pattern but in my experience that quickly becomes a PITA. You still have to maintain the visitor class but this can be sub-classed and extended. You will end up with a lot of boilerplate code all for the sake of avoiding RTTI.

Another option is just to create virtual functions for the syntactic types you are interested in. Such as isNumeric() which returns false in the Token base class but is overridden ONLY in Numeric classes to return true. If you provide default implementations for your virtual functions and let the subclasses override only when they need to then much of your problems will disappear.

RTTI is not as bad TM as it once was. Check the dates on the articles you are reading. One could also argue that pointers are a very bad idea but then you end up with languages like Java.

like image 159
jcoffland Avatar answered Sep 16 '22 22:09

jcoffland