Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing what's inside a nested structure in a regex or grammar token

I'd like to capture the interior of a nested structure.

my $str = "(a)";
say $str ~~ /"(" ~ ")" (\w) /;
say $str ~~ /"(" ~ ")" <(\w)> /;
say $str ~~ /"(" <(~)> ")" \w /;
say $str ~~ /"(" <(~ ")" \w /;

The first one works; the last one works but also captures the closing parenthesis. The other two fail, so it's not possible to use capture markers in this case. But the problem is more complicated in the context of a grammar, since capturing groups do not seem to work either, like here:

# Please paste this together with the code above so that it compiles.
grammar G {
    token TOP {
              '(' ~ ')' $<content> = .+?
    }
}

grammar H {
    token TOP {
              '(' ~ ')' (.+?)
    }
}

grammar I {
    token TOP {
              '(' ~ ')' <( .+? )>
    }
}

$str = "(one of us)";
for G,H,I -> $grammar {
    say $grammar.parse( $str );
}

Since neither capturing grouping or capture markers seem to work, except if it's assigned, on the fly, to a variable. This, however, creates an additional token I'd really like to avoid. So there are two questions

  • What is the right way to make capture markers work in nested structures?
  • Is there a way to use either capturing groups or capturing markers in tokens to get the interior of a nested structure?
like image 674
jjmerelo Avatar asked Jul 04 '20 11:07

jjmerelo


1 Answers

One solution to two issues

  • Per ugexe's comment, the [...] grouping construct works for all your use cases.

  • The <( and )> capture markers are not grouping constructs so they don't work with the regex ~ operation unless they're grouped.

  • The (...) capture/grouping construct clamps frugal matching to its minimum match when ratchet is in effect. A pattern like :r (.+?) never matches more than one character.

The behaviors described in the last two bullet points above aren't obvious, aren't in the docs, may not be per the design docs, may be holes in roast, may be figments of my imagination, etc. The rest of this answer explains what I've found out about the above three cases, and discusses some things that could be done.

Glib explanation, as if it's all perfectly cromulent

<( and )> are capture markers.

They behave as zero width assertions. Each asserts "this marks where I want capturing to start/end for the regex that contains this marker".


Per the doc for the regex ~ operator:

it mostly ignores the left argument, and operates on the next two [arguments]

(The doc says "atoms" where I've written "arguments". In reality it operates on the next two atoms or groups.)

In the regex pattern "(" ~ ")" <(\w)>:

  • ")" is the first atom/group after ~.

  • <( is the second atom/group after ~.

  • ~ ignores \w)>.


The solution is to use [...]:

say '(a)' ~~ / '(' ~ ')' [ <( \w )> ] /; # 「a」

Similarly, in a grammar:

token TOP { '(' ~ ')' [ <( .+? )> ] }

(...) grouping isn't what you want for two reasons:

  • It couldn't be what you want. It would create an additional token capture. And you wrote you'd like to avoid that.

  • Even if you wanted the additional capture, using (...) when ratchet is in effect clamps frugal matching within the parens.

What could be done about capture markers "not working"?

I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.

Is it known to be intended behavior or a bug?

Searches of GH repos for "capture markers":

  • raku/old-design-docs

  • raku/roast

  • raku/old-issue-tracker and rakudo/rakudo

  • raku/docs

The term "capture markers" comes from the doc, not the old design docs which just say:

A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object.

(Maybe you can figure out from that what strings to search for among issues etc...)

At the time of writing, all GH searches for <( or )> draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos, eg this.


I was curious and tried this:

my $str = "aaa";
say $str ~~ / <(...)>* /;

It infinitely loops. The * is acting on just the )>. This corroborates the sense that capture markers are treated as atoms.


The regex ~ operator works for [...] and some other grouped atom constructions. Parsing any of them has a start and end within a regex pattern.

The capture markers are different in that they aren't necessarily paired -- the start or end can be implicit.

Perhaps this makes treating them as we might wish unreasonably difficult for Raku given that start (/ or{) and end ( / or }) occur at a slang boundary and Raku is a single-pass parsing braid?


I think that a doc fix is probably the appropriate response to this capture marker aspect of your SO.

If regex ~ were the only regex construct that cared that left and right capture markers are each an individual atom then perhaps the best place to mention this wrinkle would be in the regex ~ section.

But given that multiple regex constructs care (quantifiers do per the above infinite loop example), then perhaps the best place is the capture markers section.

Or perhaps it would be best if it's mentioned in both. (Though that's a slippery slope...)

What could be done about :r (.*?) "not working"?

I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.

Is it known to be intended behavior or a bug?

Searches of GH repos for ratchet frugal:

  • raku/old-design-docs

  • raku/roast

  • raku/old-issue-tracker and rakudo/rakudo

  • raku/docs

The terms "ratchet" and "frugal" both come from the old design docs and are still used in the latest doc and don't seem to have aliases. So searches for them should hopefully match all relevant mentions.

The above searches are for both words. Searching for one at a time may reveal important relevant mentions that happen to not mention the other.

At the time of writing, all GH searches for .*? or similar draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos.


Perhaps the issue here is broader than the combination of ratchet, frugal, and capture?

Perhaps file an issue using the words "ratchet", "frugal" and "capture"?

like image 188
raiph Avatar answered Nov 04 '22 06:11

raiph