I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).
Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway. Although performance-wise it can be more efficient if you do it yourself, it isn't a must.
A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.
Overview. The lexer is contained in the file lex.cc . It is a hand-coded lexer, and not implemented as a state machine. It can understand C, C++ and Objective-C source code, and has been extended to allow reasonably successful preprocessing of assembly language.
Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With