I'm writing a compiler front end for a project and I'm trying to understand what's the best method of tokenize the source code. I can't choose between two ways:
1) the tokenizer read all tokens:
bool Parser::ReadAllTokens()
{
Token token;
while( m_Lexer->ReadToken( &token ) )
{
m_Tokens->push_back( token );
token.Reset(); // reset the token values..
}
return !m_Tokens->empty();
}
and then the parsing phase begins, operating on the m_Tokens
list. In this way the methods getNextToken(), peekNextToken() and ungetToken() are relatively easy to implement by iterator, and the parsing code is well written and clear ( not broken by getNextToken() i.e. :
getNextToken();
useToken();
getNextToken();
peekNextToken();
if( peeked is something )
ungetToken();
..
..
)
2) the parsing phase begins and when needed, the token is created and used ( the code seems not so clear )
What's the best method??and why??and the efficiency?? thanks in advance for the answers
Traditionally, compiler construction classes teach you to read tokens, one by one, as you parse. The reason for that, is that back in the days, memory resources were scarce. You had kilobytes to your disposal, and not gigabytes as you do today.
Having said that, I don't mean to recommend you to read all tokens in advance, and then parse from your list of tokens. Input is of arbitrary size. If you hog too much memory, the system will become slow. Since it looks like you only need one token in the lookahead, I'd read one at a time from the input stream. The operating system will buffer and cache the input stream for you, so it'll be fast enough for most purposes.
It would be better to use something like Boost::Spirit to tokenise. Why reinvent the wheel?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With