context sensitive tokenization of code

Question

I am working on a parser for a language that has

identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)

Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.

But filename can only follow a specific operator.

My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?

If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...

geoff_h · Accepted Answer

Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.

In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.

EDIT:

Just to give some concrete code this side of the hyperlink:

You would trigger your FILENAME_MODE when you encounter the operator...

{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }

You would define your rule to parse a filename:

<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }

...switching back to the INITIAL state in the action.

context sensitive tokenization of code

Tags:

parsing

token

tokenize

formal-languages

akonsu

1 Answers

geoff_h

Recent Activity

Donate For Us

context sensitive tokenization of code

Tags:

parsing

token

tokenize

formal-languages

akonsu

1 Answers

geoff_h

Related questions

Recent Activity

Donate For Us