Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl 6 Grammar doesn't match like I think it should

I'm doing Advent of Code day 9:

You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.

Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.

In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.

Of course, this screams out for a Perl 6 Grammar...

grammar Stream
{
    rule TOP { ^ <group> $ }

    rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
    rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }

    token garbignore { '!' . }
    token garbchar { <-[ !> ]> }
}

This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:

say Stream.parse('{<aa>}');

gives Nil.

Grammar::Tracer is no help:

TOP
|  group
|  |  group
|  |  * FAIL
|  |  garbage
|  |  |  garbchar
|  |  |  * MATCH "a"
|  |  * FAIL
|  * FAIL
* FAIL
Nil

Multiple garbignores are no problem:

say Stream.parse('{<!!a!a>}');

gives:

「{<!!a!a>}」
 group => 「{<!!a!a>}」
  garbage => 「<!!a!a>」
   garbignore => 「!!」
   garbchar => 「a」
   garbignore => 「!a」

Any ideas?

like image 321
mscha Avatar asked Dec 09 '17 12:12

mscha


1 Answers

UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).


My original answer below explored why the rules didn't work. I'll leave it in for now.


TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.


As your original question allowed, this isn't a bug, it's just that your mental model is off.

Your answer correctly identifies the issue: tokenization.

So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:

why ... my second example ... goes wrong with two garbchars in a row:

'{<aa>}'

Simplifying, the issue is how to tokenize this:

aa

The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.

You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.

I'll provide two redundant paths that I hope might lead folk to the correct mental model:

  • For those who prefer diving straight into nitty gritty detail, there's a reddit comment I wrote recently about tokenization in Perl 6.

  • The rest of this SO answer provides a high level discussion that complements the low level explanation in my reddit comment.

Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.

In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

By default, the Perl 6 design embodies an equivalent of these two heuristics.

The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.

I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.

like image 148
raiph Avatar answered Sep 22 '22 04:09

raiph