I'm playing with this longest token match in 2018.04 but I don't think the longest token is matching:
say 'aaaaaaaaa' ~~ m/
    | a+?
    | a+
    /; # 「a」
I expected the second alternative to have the longest token because it has a greedy quantifier. It looks like the non-greedy quantifier was counted as part of the longest token although Synopsis 5 indicates that it shouldn't be included.
If I reverse the order I get the output that I expect:
say 'aaaaaaaaa' ~~ m/
    | a+
    | a+?
    /;  # 「aaaaaaaaa」
Is this supposed to happen like this? What does the engine think the length of these tokens are? The official docs are quite vague so I'm drawing on Synopsis 5 to suss out how this is supposed to work.
I went for a dig in the compiler to see what's happening there. The + quantifier action method calls backmod which in turn sets the backtrack property to "f".
However, the code to compile an NFA for a quantifier doesn't look at the backtrack property at all, and is thus treating every quantifier the same, regardless of its backtrack mode. Thus it acts as if the ? were not there, meaning it will consider the two branches of equal length. It then uses declaration order as a tie-breaker, leading to it picking the first branch. That then applies the frugal quantifier once selected, and so matches a single "a". (This also explains why swapping the order changes things.)
This does seem out of line with what S05 envisions, which would be that the a+? should simply be considered "fate" (in this case meaning the a+? alternative would have a zero-length longest token). The specification (that is, test suite specifying the language) is silent on the matter, however, making it undefined behavior at present.
The proposed behavior in S05 makes sense to me, so I'd argue for specifying and implementing it that way. I've opened this issue to track it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With