I am using Raku 2020.10.
According to this page, https://docs.raku.org/language/regexes#Longest_alternation:_| , "|" or quoted lists are longest matches.
> say "youtube" ~~ / < you tube > /
「you」 # expected "tube" to win the match
> say "youtube" ~~ / you | tube /
「you」 # expected "tube" to win the match
> say "youtube" ~~ / tube | you /
「you」 # expected "tube" to win the match
Now trying "||" instead of "|":
> say "tubeyou" ~~ / you || tube /
「tube」 # longest match or first match?
> say "youtube" ~~ / you || tube /
「you」 # first match?
Now trying web page example:
> say 'food' ~~ / f | fo | foo | food /
「food」 # works as expected
> say 'foodtubes' ~~ / f | fo | foo | food | tubes /
「food」 # expected "tubes" (5 chars) to win
> say 'foodtubes' ~~ / tubes | f | fo | foo | food /
「food」
> say 'foodtubes' ~~ / dt /
「dt」
> say 'foodtubes' ~~ / dt | food /
「food」
> say 'foodtubes' ~~ / dt | food | tubes /
「food」
Seems like the matching engine with "|" quits after first somewhat longish successful match. Or what did I do wrong?
Thanks !!!
(This answer builds on what @donaldh already said in a comment).
This is a really good question, because it gets at something that often trips people up about how a regex searches a string: a regex fundamentally searches one character at a time and returns the first match it finds. You can modify this behavior (e.g., look-arounds consider other characters; the several flags make the regex return more than one result). But if you start from the basic understanding of how the behaves regex by default, a lot of these issues become clearer.
So, let's apply that to a slight variant of your example:
> `youtube' ~~ / you | ..| tube /
「you」
Here's how the regex engine looks at it (in high-level/simplified terms), character by character:
pos:0 youtube
^
branch 1 wants 'y'. Match!
branch 2 wants . (aka, anything). Match!
branch 3 wants 't' No match :(
pos:1 youtube
^
branch 1 wants 'o'. Match!
branch 2 wants . Match!
branch 2 completed with a length of 2
pos:2 youtube
^
branch 1 wants 'u'. Match!
branch 1 completed with a length of 3
...all branches completed, and 2 matches found. Return the longest match found.
「you」
The consequence of this logic is that, as always, the regex returns the first match in the string (or, even more specifically, the match that starts at the earliest position in the string). The behavior of |
kicks in when there are multiple matches that start at the same place. When that happens, |
means that we get the longest match.
Conversely, with 'youtube' ~~ / you | tube /
, we never have multiple matches that start at the same place, so we never need to rely on the behavior of |
. (We do have multiple matches in the string, as you can see with a global search: 'youtube' ~~ m:g/ you | tube /
)
If you want the longest of all matches in the string (rather than the longest option for the first match) then you can do so with something like the following:
('youtube' ~~ m:g/ you | tube /).sort(*.chars).tail
This is not a question of longest match.
This is a question of earliest match.
'abcd' ~~ / bcd | . /; # 「a」
Imagine that the above regex is actually surrounded by this:
/^ .*? <([ … ])> .* $/
So then we have:
/^ .*? <([ bcd | . ])> .* $/
Note that the first .*?
is non-greedy. It prefers to not capture anything.
'abcd' ~~ /^ .*? <([ bcd | . ])> .* $/; # 「a」
It will if it has to though
'abcd' ~~ /^ .*? <([ bcd | b ])> .* $/; # 「bcd」
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With