I wanted to generate regex from an existing list of values, but when I attempted to use a capture within it, the capture was not present in the match. Is it not possible to have a capture using interpolation, or am I doing something wrong?
my @keys = <foo bar baz>;
my $test-pattern = @keys.map({ "<$_>" }).join(' || ');
grammar Demo1 {
token TOP {
[
|| <foo>
|| <bar>
|| <baz>
] ** 1..* % \s+
}
token foo { 1 }
token bar { 2 }
token baz { 3 }
}
grammar Demo2 {
token TOP {
[ <$test-pattern> ] ** 1..* % \s+
}
token foo { 1 }
token bar { 2 }
token baz { 3 }
}
say $test-pattern, "\n" x 2, Demo1.parse('1 2 3'), "\n" x 2, Demo2.parse('1 2 3');
<foo> || <bar> || <baz>
「1 2 3」
foo => 「1」
bar => 「2」
baz => 「3」
「1 2 3」
Capturing group. (regex) Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. They allow you to apply regex operators to the entire grouped regex. (abc){3} matches abcabcabc. First group matches abc.
Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. (?:abc){3} matches abcabcabc. No groups. Substituted with the text matched between the 1st through 9th numbered capturing group.
In these cases, non-matching groups simply won't contain any information. If a quantifier is placed behind a group, like in (qux)+ above, the overall group count of the expression stays the same. If a group matches more than once, its content will be the last match occurrence. However, modern regex flavors allow accessing all sub-match occurrences.
If a quantifier is placed behind a group, like in (qux)+ above, the overall group count of the expression stays the same. If a group matches more than once, its content will be the last match occurrence. However, modern regex flavors allow accessing all sub-match occurrences.
The rule for determining whether an atom of the form <...>
captures without further ado is whether or not it starts with a letter or underscore.
If an assertion starts with a letter or underscore, then an identifier is expected/parsed and a match is captured using that identifier as the key in the enclosing match object. For example, <foo::baz-bar qux>
begins with a letter and captures under the key foo::baz-bar
.
If an assertion does not begin with a letter or underscore, then by default it does not capture.
To capture the results of an assertion whose first character is not a letter or underscore you can either put it in parens or name it:
( <$test-pattern> ) ** 1..* % \s+
or, to name the assertion:
<test-pattern=$test-pattern> ** 1..* % \s+
or (just another way to have the same naming effect):
$<test-pattern>=<$test-pattern> ** 1..* % \s+
If all you do is put an otherwise non-capturing assertion in parens, then you have not switched capturing on for that assertion. Instead, you've merely wrapped it in an outer capture. The assertion remains non-capturing, and any sub-capture data of the non-capturing assertion is thrown away.
Thus the output of the first solution shown above (wrapping the <$test-pattern>
assertion in parens) is:
「1 2 3」
0 => 「1」
0 => 「2」
0 => 「3」
Sometimes that's what you'll want to simplify the parse tree and/or save memory.
In contrast, if you name an otherwise non-capturing assertion with either of the named forms shown above, then by doing so you convert it into a capturing assertion, which means any sub capture detail will be retained. Thus the named solutions produce:
「1 2 3」
test-pattern => 「1」
foo => 「1」
test-pattern => 「2」
bar => 「2」
test-pattern => 「3」
baz => 「3」
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With