Lexers/tokenizers and character sets

Question

When constructing a lexer/tokenizer is it a mistake to rely on functions (in C) such as isdigit/isalpha/...? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple character sets. Do I produce one lexer/tokenizer for each character set or do I try to code the one I wrote so that the only thing I have to do is change the character mapping. What are common practices?

t0mm13b · Accepted Answer

For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.

And no, it is not a mistake to rely on the ctype's functions such as isdigit, isalpha and so on...

Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

It would be defined something like that in that context...

Hope this helps, Best regards, Tom.

Tronic · Answer

The ctype.h functions are not very usable for chars that contain anything but ASCII. The default locale is C (essentially the same as ASCII on most machines), no matter what the system locale is. Even if you use setlocale to change the locale, the chances are that the system uses a character set with bigger than 8 bit characters (e.g. UTF-8), in which case you cannot tell anything useful from a single char.

Wide chars handle more cases properly, but even they fail too often.

So, if you want to support non-ASCII isspace reliably, you have to do it yourself (or possibly use an existing library).

Note: ASCII only has character codes 0-127 (or 32-127) and what some call 8 bit ASCII is actually some other character set (commonly CP437, CP1252, ISO-8859-1 and often also something else).

Lexers/tokenizers and character sets

Tags:

c

character-encoding

tokenize

lexical-analysis

Questionable

2 Answers

t0mm13b

Tronic

Recent Activity

Donate For Us

Lexers/tokenizers and character sets

Tags:

c

character-encoding

tokenize

lexical-analysis

Questionable

2 Answers

t0mm13b

Tronic

Related questions

Recent Activity

Donate For Us