Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).

For instance,

begin x:=1;y:=2;

then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).

like image 498
REALFREE Avatar asked Feb 07 '13 21:02

REALFREE


People also ask

Should you use regex in a lexer?

Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway. Although performance-wise it can be more efficient if you do it yourself, it isn't a must.

What is Tokenize in regex?

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.

What is lexer file?

Overview. The lexer is contained in the file lex.cc . It is a hand-coded lexer, and not implemented as a state machine. It can understand C, C++ and Objective-C source code, and has been extended to allow reasonably successful preprocessing of assembly language.


1 Answers

Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.

Although performance-wise it can be more efficient if you do it yourself, it isn't a must.

like image 148
Oak Avatar answered Nov 29 '22 09:11

Oak