When constructing a lexer/tokenizer is it a mistake to rely on functions (in C) such as isdigit/isalpha/...? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple character sets. Do I produce one lexer/tokenizer for each character set or do I try to code the one I wrote so that the only thing I have to do is change the character mapping. What are common practices?
For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.
And no, it is not a mistake to rely on the ctype's functions such as isdigit
, isalpha
and so on...
Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...
#ifdef LEX_WIDECHARS #include <wctype.h> #define isdigit iswdigit #else #define isdigit #endif
It would be defined something like that in that context...
Hope this helps, Best regards, Tom.
The ctype.h functions are not very usable for chars that contain anything but ASCII. The default locale is C
(essentially the same as ASCII on most machines), no matter what the system locale is. Even if you use setlocale
to change the locale, the chances are that the system uses a character set with bigger than 8 bit characters (e.g. UTF-8), in which case you cannot tell anything useful from a single char.
Wide chars handle more cases properly, but even they fail too often.
So, if you want to support non-ASCII isspace reliably, you have to do it yourself (or possibly use an existing library).
Note: ASCII only has character codes 0-127 (or 32-127) and what some call 8 bit ASCII is actually some other character set (commonly CP437, CP1252, ISO-8859-1 and often also something else).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With