Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stopping Raku grammar at EOS (End of String)

In the process of writing a translator of one music language to another (ABC to Alda) as an excuse to learn Raku DSL-ability, I noticed that there doesn't seem to be a way to terminate a .parse! Here is my shortened demo code:

#!/home/hsmyers/rakudo741/bin/perl6
use v6d;

# use Grammar::Debugger;
use Grammar::Tracer;

my $test-n01 = q:to/EOS/;
a b c d e f g
A B C D E F G
EOS

grammar test {
  token TOP { <score>+ }
  token score {
      <.ws>?
      [
          | <uc>
          | <lc>
      ]+
      <.ws>?
  }
  token uc { <[A..G]> }
  token lc { <[a..g]> }
}

test.parse($test-n01).say;

And it is the last part of the Grammer::Tracer display that demonstrates my problem.

|  score
|  |  uc
|  |  * MATCH "G"
|  * MATCH "G\n"
|  score
|  * FAIL
* MATCH "a b c d e f g\nA B C D E F G\n"
「a b c d e f g
A B C D E F G
」

On the second to last line, the word FAIL tells me that the .parse run has no way of quitting. I wonder if this is correct? The .say displays everything as it should be, so I'm not clear on how real the FAIL is? The question remains, "How do I correctly write a grammar that parses multiple lines without error?"

like image 857
hsmyers Avatar asked Dec 26 '19 01:12

hsmyers


1 Answers

When you use the grammar debugger, it lets you see exactly how the engine is parsing the string — fails are normal and expected. Considered, for example, matching a+b* with the string aab. You should get two matches for 'a', followed by a fail (because b is not a) but then it will retry with b and successfully match.

This might be more easily seen if you do an alternation with || (which enforces order). If you have

token TOP   { I have a <fruit> }
token fruit { apple || orange || kiwi }

and you parse the sentence "I have a kiwi", you'll see it first match "I have a", followed by two fails with "apple" and "orange", and finally a match with "kiwi".

Now let's look at your case:

TOP                  # Trying to match top (need >1 match of score)
|  score             #   Trying to match score (need >1 match of lc/uc)
|  |  lc             #     Trying to match lc
|  |  * MATCH "a"    #     lc had a successful match! ("a")
|  * MATCH "a "      #   and as a result so did score! ("a ")
|  score             #   Trying to match score again (because <score>+)
|  |  lc             #     Trying to match lc 
|  |  * MATCH "b"    #     lc had a successful match! ("b")
|  * MATCH "b "      #   and as a result so did score! ("b ")
……………                #     …so forth and so on until…
|  score             #   Trying to match score again (because <score>+)
|  |  uc             #     Trying to match uc
|  |  * MATCH "G"    #     uc had a successful match! ("G")
|  * MATCH "G\n"     #   and as a result, so did score! ("G\n")
|  score             #   Trying to match *score* again (because <score>+)
|  * FAIL            #   failed to match score, because no lc/uc.
|
|  # <--------------   At this point, the question is, did TOP match?
|  #                     Remember, TOP is <score>+, so we match TOP if there 
|  #                     was at least one <score> token that matched, there was so...
|
* MATCH "a b c d e f g\nA B C D E F G\n" # this is the TOP match

The fail here is normal: at some point we will run out of <score> tokens, so a fail is inevitable. When that happens, the grammar engine can move on to whatever comes after the <score>+ in your grammar. Since there's nothing, that fail actually results in a match of the entire string (because TOP matches with implicit /^…$/).

Also, you might consider rewriting your grammar with a rule which inserts <.ws>* automatically (unless it's important for it to be a single space only):

grammar test {
  rule TOP { <score>+ }
  token score {
      [
          | <uc>
          | <lc>
      ]+
  }
  token uc { <[A..G]> }
  token lc { <[a..g]> }
}

Further, IME, you might want to also want to add a proto token for the uc/lc, because when you have [ <foo> | <bar> ] you will always have one of them be undefined which can make processing them in an actions class a bit annoying. You could try:

grammar test {
  rule  TOP   { <score>  + }
  token score { <letter> + }

  proto token letter    {     *    }
        token letter:uc { <[A..G]> }
        token letter:lc { <[a..g]> }
}

$<letter> will always be defined this way.

like image 164
user0721090601 Avatar answered Nov 07 '22 07:11

user0721090601