I am working on a parser for a language that has
identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^
),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)
Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.
But filename can only follow a specific operator.
My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?
If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...
Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.
In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode
state (or whatever) and then switch back once it has produced the filename token it was expecting.
EDIT:
Just to give some concrete code this side of the hyperlink:
You would trigger your FILENAME_MODE
when you encounter the operator...
{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
You would define your rule to parse a filename:
<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
...switching back to the INITIAL
state in the action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With