Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing nested tuples using perl regexp with $^R

Tags:

regex

perl

I want to learn how to create an abstract syntax tree for nested tuples using a Perl regexp with embedded code execution. I can easily program that using a Perl 6 grammar and I'm aware that using parsing modules would simplify the task in Perl 5, but I think for such simple tasks I should be able to do it without modules by learning how to mechanically translate from grammar definitions. I couldn't find a way to dereference $^R, so I try to undo the involuntary nesting at the end of the TUPLE rule definition, but the output is incorrect, e.g. some substrings appear twice.

use v5.10;
use Data::Dumper;

while (<DATA>) {
    chomp;
    /(?&TUPLE)(?{$a = $^R})
    (?(DEFINE)
        (?<TUPLE>
            T \s (?&ELEM) \s (?&ELEM)
            (?{ [$^R->[0][0],[$^R->[0][1],$^R[1]]] })
        )
        (?<ELEM>
            (?: (a) (?{ [$^R,$^N] }) | \( (?&TUPLE) \) )
        )
    )/x;
    say Dumper $a;
}

__DATA__
T a a
T (T a a) a
T a (T a a)
T (T a a) (T a a)
T (T (T a a) a) (T a (T a a))

Expected output data structure is a nested list:

['a','a'];
['a',['a','a']];
[['a','a'],'a'];
[['a','a'],['a','a']];
[[['a','a'],'a'],['a',['a','a']]]

For reference I'll also share my working Perl 6 code:

grammar Tuple {
  token TOP { 'T ' <elem> ' ' <elem> }
  token elem { 'a' | '(' <TOP> ')'}
}
class Actions {
  method TOP($/) {make ($<elem>[0].made, $<elem>[1].made)}
  method elem($/) {make $<TOP> ?? $<TOP>.made !! 'a'}
}
like image 400
rubystallion Avatar asked Nov 05 '17 10:11

rubystallion


1 Answers

Trying to figure out how to use (?{ ... }) constructs is almost always not worth the effort. In particular, this can have unexpected behaviour together with backtracking. It is also very difficult to debug such regexes since the control flow tends to be non-obvious.

Instead, it tends to be easier to do write an ad-hoc recursive descent parser with m//gc-style lexing: Each Perl string stores its last match offset. When applying a regex with m/\G ... /gc in scalar context, it can anchor at the last offset and advances the offset iff the match succeeds.

Here:

use strict;
use warnings;
use Test::More;

sub parse {
  my ($str) = @_;
  pos($str) = 0;  # set match position to beginning
  return parse_tuple(\$str);
}

sub parse_tuple {
  my ($ref) = @_;
  $$ref =~ /\G T \s/gcx or die error($ref, "expected tuple start T");
  my $car = parse_element($ref);
  $$ref =~ /\G \s /gcx or die error($ref, "expected space between tuple elements");
  my $cdr = parse_element($ref);
  return [$car, $cdr];
}

sub parse_element {
  my ($ref) = @_;
  return 'a' if $$ref =~ /\G a /gcx;

  $$ref =~ /\G \( /gcx or die error($ref, "expected opening paren for nested tuple");
  my $tuple = parse_tuple($ref);
  $$ref =~ /\G \) /gcx or die error($ref, "expected closing paren after nested tuple");
  return $tuple;
}

sub error {
  my ($ref, $msg) = @_;
  my $snippet = substr $$ref, pos($$ref), 20;
  return "$msg just before '$snippet...'";
}

is_deeply parse('T a a'), ['a','a'];
is_deeply parse('T (T a a) a'), [['a','a'],'a'];
is_deeply parse('T a (T a a)'), ['a',['a','a']];
is_deeply parse('T (T a a) (T a a)'), [['a','a'],['a','a']];
is_deeply parse('T (T (T a a) a) (T a (T a a))'), [[['a','a'],'a'],['a',['a','a']]];
done_testing;
like image 103
amon Avatar answered Nov 15 '22 09:11

amon