Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizer efficiency question

I'm writing a compiler front end for a project and I'm trying to understand what's the best method of tokenize the source code. I can't choose between two ways:

1) the tokenizer read all tokens:

bool Parser::ReadAllTokens()
{
  Token token;
  while( m_Lexer->ReadToken( &token ) )
  {
    m_Tokens->push_back( token );
    token.Reset(); // reset the token values..
  }

  return !m_Tokens->empty();
}

and then the parsing phase begins, operating on the m_Tokens list. In this way the methods getNextToken(), peekNextToken() and ungetToken() are relatively easy to implement by iterator, and the parsing code is well written and clear ( not broken by getNextToken() i.e. :

 getNextToken();
 useToken();
 getNextToken();
 peekNextToken();
 if( peeked is something )
  ungetToken();
 ..
 ..

)

2) the parsing phase begins and when needed, the token is created and used ( the code seems not so clear )

What's the best method??and why??and the efficiency?? thanks in advance for the answers

like image 745
Salv0 Avatar asked Feb 26 '23 08:02

Salv0


2 Answers

Traditionally, compiler construction classes teach you to read tokens, one by one, as you parse. The reason for that, is that back in the days, memory resources were scarce. You had kilobytes to your disposal, and not gigabytes as you do today.

Having said that, I don't mean to recommend you to read all tokens in advance, and then parse from your list of tokens. Input is of arbitrary size. If you hog too much memory, the system will become slow. Since it looks like you only need one token in the lookahead, I'd read one at a time from the input stream. The operating system will buffer and cache the input stream for you, so it'll be fast enough for most purposes.

like image 108
Jörgen Sigvardsson Avatar answered Mar 07 '23 23:03

Jörgen Sigvardsson


It would be better to use something like Boost::Spirit to tokenise. Why reinvent the wheel?

like image 37
T33C Avatar answered Mar 08 '23 00:03

T33C