Is there a possibility to define a custom lexer for a raku grammar, i.e. one that converts a string to a stream of int id + value? I was playing around with the grammar construct.
Rules seem intuitive as they are probably converted to functions in a recursie descent parser. However tokens and regex I would expect to be able to be broken out with explicit token-ids and an interface to map these to a name so that I can write my own lexer?
Raku grammars are a form of scannerless parsing, where the lexical structure and parse structure are specified together.
While it's true that rules form a recursive descent parser, that's only half of the story. When protoregexes or alternations (the |
kind, not the ||
kind) are used, the declarative prefixes of these are gathered and a NFA is formed. It is then used to determine which of the alternation branches should be explored, if any; if there are multiple, they are ranked according to longest first, with longest literal and inheritance depth used as a tie-breaker.
Forming the declarative prefix involves looking down through the subrule calls to find lexical elements - effectively, then, the tokens. Thus, we could say that Raku grammars derive the tokenizer (actually many tokenizers) for us. These are typically generally generated at compile time, however, for things like custom operators, which are done by mixing in to the grammar, further NFAs will have to be produced at runtime too in order to account for the new tokens.
There's currently no way to hook into grammar compilation and do things differently (at least, not without playing with compiler internals). However, there probably will be in the next major language release, where the AST of a Raku program will be made available to the language user, and it will thus be possible to write modules that affect the compilation of different program constructs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With