Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.

Given input in this form: #1 @ 1,3: 4x4 and this grammar that I created:

grammar Claim {
  token TOP {
    '#' <id> \s* '@' \s* <coordinates> ':' \s* <dimensions>
  }

  token digits {
    <digit>+
  }

  token id {
    <digits>
  }

  token coordinates {
    <digits> ',' <digits>
  }

  token dimensions {
    <digits> 'x' <digits>
  }
}

say Claim.parse('#1 @ 1,3: 4x4');

I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.

say $match<id>.hash<digits>.<digit>;

this seems a little messy, is there a better way?

like image 536
Hunter McMillen Avatar asked Dec 10 '18 18:12

Hunter McMillen


2 Answers

For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.

Like @Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:

/ ^
  '#' $<id>=(\d+) ' '
  '@ ' $<x>=(\d+) ',' $<y>=(\d+)
  ': ' $<w>=(\d+)  x  $<d>=(\d+)
  $
/;

say ~$<id x y w d>; # 1 1 3 4 4

(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)

With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.

Digging less messily

say $match<id>.hash<digits>.<digit>; # [「1」]

this seems a little messy, is there a better way?

Your say includes unnecessary code and output nesting. You could just simplify to something like:

say ~$match<id> # 1

Digging a little deeper less messily

I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.

For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))

One way to write your say to get the y coordinate:

say ~$match<coordinates><digits>[1] # 3

If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:

token coordinates { (<digits>) ',' (<digits>) }

Now you've eliminated the need to mention <digits>:

say ~$match<coordinates>[1] # 3

You could also name the new parenthesized captures:

token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }

say ~$match<coordinates><y> # 3

Pre-digging

I have to dig down through each grammar production to get the value I need

The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.

Another step is to do the digging work as part of the parsing process so that the say is simple.

You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):

my $made;
grammar Claim {
  token TOP {
    '#' <id> \s* '@' \s* <coordinates> ':' \s* <dimensions>
     { $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
  }
...

Now you can write just:

say $made # 1 1 3 4 4

This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.

Pre-digging less messily

Inlining code is quick and dirty. So is using a variable.

The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:

{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }

And now you can write:

say $match.made # 1 1 3 4 4

That's much tidier. But there's more.

A sparse subtree of a parse tree

.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )

In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).

This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.

The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.

In fact there's an alias for .made, namely .ast:

say $match.ast # 1 1 3 4 4

While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.

Making it all elegant

For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.

In summary

There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.

Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.

Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.

One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.

like image 110
raiph Avatar answered Nov 18 '22 13:11

raiph


You can refer to each of you named portions directly. So to get the cordinates you can access :

say $match.<coordinates>.<digits>

this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :

say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int

to cast them to Ints

For the id field it's even easier you can just cast the <id> match to an Int :

say $match.<id>.Int

like image 2
Scimon Proctor Avatar answered Nov 18 '22 15:11

Scimon Proctor