Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

context sensitive tokenization of code

I am working on a parser for a language that has

  • identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),

  • integers (any number of digits and possibly carets ^),

  • some operators,

  • filename (a number of alphanumeric characters and possibly slashes, and dots)

Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.

But filename can only follow a specific operator.

My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?

If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...

like image 599
akonsu Avatar asked Oct 20 '22 04:10

akonsu


1 Answers

Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.

In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.

EDIT:

Just to give some concrete code this side of the hyperlink:

You would trigger your FILENAME_MODE when you encounter the operator...

{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }

You would define your rule to parse a filename:

<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }

...switching back to the INITIAL state in the action.

like image 118
geoff_h Avatar answered Oct 22 '22 23:10

geoff_h