Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access groups captured by recursive perl regexes?

I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,

my $GRAMMAR = qr{(?(DEFINE)
  (?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
  (?<number> \d++ )
  (?<var> [a-z]++ )
  (?<op> [-+*/] )
)}x;

I would like to be able to run this as

$expr =~ /$GRAMMAR(?&expr)/;

and then access all the variable names. However, according to perlre,

Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.

So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).

Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?

like image 268
Steve Avatar asked Jul 11 '13 20:07

Steve


People also ask

What is S in Perl regex?

The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/;

What is $1 Perl?

$1 equals the text " brown ".

What is K in regex?

\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match. To make the explanation short, consider the following simple Regex: a\Kb. When "b" is matched, \K tells the Regex engine to pretend that the match attempt started at this position.


2 Answers

The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.

However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.

Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces

my $EXPR = do {
  use Regexp::Grammars;
  qr{
    ^ <Expr> $

    <rule: Expr>        <Term> <ExprTail>
               |        <Term>

    <rule: Term>        <Number>
               |        <Var>
               |        \( <MATCH=Expr> \)

    <rule: ExprTail>    <Op> <Expr>

    <token: Op>         \+ | \- | \* | \/

    <token: Number>     \d++

    <token: Var>        [a-z]++
  }x;
};

Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.

You want to extract all variable names, so you could walk the AST as in

sub all_variables {
  my($root,$var) = @_;

  $var ||= {};
  ++$var->{ $root->{Var} } if exists $root->{Var};
  all_variables($_, $var) for grep ref $_, values %$root;

  wantarray ? keys %$var : [ keys %$var ];
}

and print the result with

if ("(a + (b - c))" =~ $EXPR) {
  print "[$_]\n" for sort +all_variables \%/;
}
else {
  print "no match\n";
}

Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.

package JustTheVarsMaam;

sub new { bless {}, shift }

sub Var {
  my($self,$result) = @_;
  ++$self->{VARS}{$result};
  $result;
}

sub all_variables { keys %{ $_[0]->{VARS} } }

1;

Call this one as in

my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
  print "[$_]\n" for sort $vars->all_variables;
}
else {
  print "no match\n";
}

Either way, the output is

[a]
[b]
[c]
like image 69
Greg Bacon Avatar answered Oct 04 '22 21:10

Greg Bacon


Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:

#!env perl
use strict;
use diagnostics;
use Marpa::R2;

my $input = shift || '(a + (b - c))';

my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
  (
   {
    grammar => Marpa::R2::Scanless::G->new
    (
     {
      source => \$grammar_source,
      action_object => __PACKAGE__,
     }
    )
   },
  );
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};

$recognizer->read(\$input);
$recognizer->value() || die "No parse";

print join(', ', sort keys %vars)  . "\n";

__DATA__
:start ::= expr
expr ::= NUMBER
       | VAR action => varAction
       | expr OP expr
       | '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS

The output is:

a, b, c

Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.

like image 21
Jean-Damien Durand Avatar answered Oct 04 '22 22:10

Jean-Damien Durand