Anti-matching against an infinite family of patterns in Raku

Question

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.

Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.

say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
        | $ 
        | '_' <?before $> 
        | '_' <?before ['_' <?before $>]>
        | '_' <?before ['_' <?before ['_' <?before $>]>]>
        # ...
    ]>
]+
/;

Is there a way to construct the rest of the pattern implied by the ...?

Brad Gilbert · Accepted Answer

It is a little difficult to discern what you are asking for.

You could be looking for something as simple as this:

say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# ｢x_x___x｣

or

say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# ｢x_x｣

or

say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# ｢x_x___x｣

p6steve · Answer

I would suggest using a Capture..., thusly:

'x_x___x________' ~~ /(.*?) _* $/; 
say $0;     #｢x_x___x｣

(The ? modifier makes the * 'non-greedy'.) Please let me know if I have missed the point!

raiph · Answer

avoid matching whitespace at the end of a string while still matching whitespace in the middle of words

Per Brad's answer, and your comment on it, something like this:

/ \w+ % \s+ /

what I'm looking for is a way to match arbitrarily long streams that end with a known pattern

Per @user0721090601's comment on your Q, and as a variant of @p6steve's answer, something like this:

/ \w+ % \s+ )> \s* $ /

The )> capture marker marks where capture is to end.

You can use arbitrary patterns on the left and right of that marker.

an infinite family of <!before> patterns

Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.^{[1] [2]}

Is there a way to construct the rest of the pattern implied by the ...?

I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"

The answer is always "Yes":

While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.^[3]
Rules can do arbitrary turing complete computation.^[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.

In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.

Footnotes

^[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.

^[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.^[1]

^[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:

$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.

For example:

'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123

The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.

^[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:

Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.

jubilatious1 · Answer

I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:

> say 'x x  x   ' ~~ m:g/ 'x'+  \s* /;    
(｢x ｣ ｢x  ｣ ｢x   ｣)    

> say 'x x  x   '.trim-trailing ~~ m:g/ 'x'+  \s* /;    
(｢x ｣ ｢x  ｣ ｢x｣)

HTH.

https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Anti-matching against an infinite family of <!before> patterns in Raku

Tags:

grammar

raku

littlebenlittle

4 Answers

Brad Gilbert

p6steve

Footnotes

raiph

jubilatious1

Recent Activity

Donate For Us

Anti-matching against an infinite family of <!before> patterns in Raku

Tags:

grammar

raku

littlebenlittle

4 Answers

Brad Gilbert

p6steve

Footnotes

raiph

jubilatious1

Related questions

Recent Activity

Donate For Us