I have a simple question about two unicode-characters, which I want to use in my programming language. For an assignement I want to use the old APL Symbols ← as well as →.
My flex-file (snazzle.l) looks like the following:
/** [email protected] 2017 **/
/** parser for omni programming language. **/
%{
#include <iostream>
using namespace std;
#define YY_DECL extern "C" int yylex()
int linenum = 0;
%}
%%
[\n] {++linenum;}
[ \t] ;
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)? { cout << linenum << ". Found a floating-point number: " << yytext << endl; }
\"[^\"]*\" { cout << linenum << ". Found string: " << yytext << endl; }
[0-9]+ { cout << linenum << ". Found an integer: " << yytext << endl; }
[a-zA-Z0-9]+ { cout << linenum << ". Found an identifier: " << yytext << endl; }
([\←])|([\→])|(:=)|(=:) { cout << linenum << ". Found assignment operator: " << yytext <<endl; }
[\;] { cout << linenum << ". Found statement delimiter: " << yytext <<endl; }
[\[\]\(\)\{\}] { cout << linenum << ". Found parantheses: " << yytext << endl; }
%%
main() {
// lex through the input:
yylex();
}
When I "snazzle" the following input:
x → y;
I get the assignement character a) wrong and b) three (3) times:
0. Found an identifier: x
0. Found assignment operator: �
0. Found assignment operator: �
0. Found assignment operator: �
0. Found an identifier: y
0. Found statement delimiter: ;
How can I add ← and → as possible flex-characters?
Flex produces eight-bit clean scanners; that is, it can handle any input consisting of arbitrary octets. It knows nothing about UTF-8 or Unicode codepoints, but that doesn't stop it from recognizing a Unicode input character as a sequence of octets (not a single character). Which sequence it will be depends on which Unicode encoding you are using, but assuming that your files are UTF-8, → will be the three bytes e2 86 92 and ← will be e2 86 90.
You don't actually have to know that, however; you can just put the UTF-8 sequence into your flex pattern. You don't even need to quote it, although it is probably a good idea because it will prove less confusing if you end up using regular expression operators. Here I really mean quote it, as in "←". \← will not do what you expect, because the \ only applies to the next octet (as I said, flex knows nothing about Unicode encodings), which is only the first of the three bytes in that symbol. In other words, "←"? really means "an optional left-arrow", while \←? means "the two octets \xE2 \x86 optionally followed by \x90". I hope that's clear.
Flex character classes are not useful for Unicode sequences (or any other multi-character sequence) because a character class is a set of octets. So if you write [←], flex will interpret that as "one of the octets \xE2, \x86 or \x90". [Note 1]
It is rarely necessary to backslash-escape characters inside flex character classes; the only character which must be backslash-escaped is the backslash itself. It is not an error to escape characters which don't need escaping, so flex won't complain about it, but it makes the character classes hard for humans to read (at least, for this human to read). So [\←] means exactly the same as [←] and you could write [\[\]\(\)\{\}] as [][)(}{]. (] does not close a character class if it is the first character in the class, so it is conventional to write parentheses "face-to-face").
It is also not necessary to parenthesize character sequences inside alternatives, so you could write ([\←])|([\→])|(:=)|(=:) as ←|→|:=|=:. Or, if you prefer, "←"|"→"|":="|"=:". Of course, you wouldn't usually do that, since the scanner normally informs the parser about each individual operator. If your intention is to make ← a synonym of :=, then you would probably end up with:
←|:= { return LEFT_ARROW; }
→|=: { return RIGHT_ARROW; }
Rather than inserting printf actions in your scanner specification, you would be better off asking flex to put your scanner in debug mode. That is as simple as adding -d to the flex command line when you are building your scanner. See the flex manual section on debugging for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With