I'm attempting to write a reusable parsing library (for fun).
I wrote a Lexer
class which generates a sequence of Tokens
. Token
is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties. For example, there is a subclass LiteralNumber
(deriving from Literal
and through it from Token
), which has its own specific methods for dealing with numeric value of its lexeme. Methods for dealing with lexemes in general (retrieving their character string representation, position in the source etc.) are in the base class, Token
, because they're general to all token types. Users of this class hierarchy can derive their own classes for specific token types not predicted by me.
Now I have a Parser
class which reads the stream of tokens and tries to match them with its syntax definition. For example, it has a method matchExpression
, which in turn calls matchTerm
and this one calls matchFactor
, which has to test if the current token is Literal
or Name
(both derived from Token
base class).
The problem is:
I need to check now what is the type of the current token in the stream and whether it matches the syntax or not. If not, throw an EParseError
exception. If yes, act accordingly to get its value in the expression, generate machine code, or do whatever the parser needs to do when the syntax matches.
But I've read much about that checking the type in runtime, and deciding from it, is a bad design™, and it should be refactored as polymorphic virtual methods. Of course, I agree with that.
So my first attempt was to put some type
virtual method in the Token
base class, which would be overrided by the derived classes and return some enum
with type id.
But I already see a shortcomings of this approach: Users deriving from Token
their own classes of tokens won't be able to add additional id's to the enum
, which is in the library source! :-/ And the goal was to allow them for extending the hierarchy for new types of tokens when they'll need it.
I could also return some string
from the type
method, which would allow for easy defining new types.
But still, in both these cases, the information about base types is lost (only leaf type is returned from type
method) and the Parser
class wouldn't be able to detect the Literal
derived type when someone would derive from it and override the type
to return something other than "Literal"
.
And of course the Parser
class, which also is meant for extending by users (that is, writing their own parsers, recognizing their own tokens and syntax) doesn't know what descendants of the Token
class will be there in the future.
Many FAQs and books on design recommend in this scenario to take the behavior from the code which needs to decide by type, and put it into the virtual method overriden in derived classes. But I cannot imagine how could I put this behavior into the Token
descendants, because it's not their busines, for example, to generate machine code, or evaluate expressions. Moreover, there are parts of the syntax which need to match more than one token, so there is no one particular token which I could put that behavior into. It's rather the responsibility of particular syntax rules, which could match more than one token as their terminal symbols.
Any ideas how to improve this design?
RTTI is well supported by all major C++ compilers. This includes at least GCC, Intel and MSVC. The portability concerns are really a thing of the past.
If it is the syntax you don't like then here is a nice solution to pretty up RTTI:
class Base {
public:
// Shared virtual functions
// ...
template <typename T>
T *instance() {return dynamic_cast<T *>(this);}
};
class Derived : public Base {
// ...
};
// Somewhere in your code
Base *x = f();
if (x->instance<Derived>()) ;// Do something
// or
Derived *d = x->instance<Derived>();
A common alternative to RTTI for a parser AST using virtual function overloading, without maintaining your own type enumeration, is to use the visitor pattern but in my experience that quickly becomes a PITA. You still have to maintain the visitor class but this can be sub-classed and extended. You will end up with a lot of boilerplate code all for the sake of avoiding RTTI.
Another option is just to create virtual functions for the syntactic types you are interested in. Such as isNumeric() which returns false in the Token base class but is overridden ONLY in Numeric classes to return true. If you provide default implementations for your virtual functions and let the subclasses override only when they need to then much of your problems will disappear.
RTTI is not as bad TM as it once was. Check the dates on the articles you are reading. One could also argue that pointers are a very bad idea but then you end up with languages like Java.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With