Grammar a bit too greedy in Perl6

Question

I am having problems with this mini-grammar, which tries to match markdown-like header constructs.

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}
}

I would like it to match ## Easier ## as a header, but instead it takes ## as part of span:

TOP
|  header
|  |  hashes
|  |  * MATCH "##"
|  |  span
|  |  |  like-a-word
|  |  |  * MATCH "Easier"
|  |  |  like-a-word
|  |  |  * MATCH "##"
|  |  |  like-a-word
|  |  |  * FAIL
|  |  * MATCH "Easier ##"
|  * MATCH "## Easier ##"
* MATCH "## Easier ##
"
｢## Easier ##
｣
 header => ｢## Easier ##｣
  hashes => ｢##｣
  span => ｢Easier ##｣
   like-a-word => ｢Easier｣
   like-a-word => ｢##｣

The problem is that the [\h* $0]? simply does not seem to work, with span gobbling up all available words. Any idea?

moritz · Accepted Answer

First, as others have pointed out, <hashes> does not capture into $0, but instead, it captures into $<hashes>, so you have to write:

regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}

But that still doesn't match the way you want, because the [\h* $<hashes>]? part happily matches zero occurrences.

The proper fix is to not let span match ## as a word:

role Like-a-word {
    regex like-a-word { <!before '#'> \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##
", :rule<header>);

If you are loath to modify like-a-word, you can also force the exclusion of a final # from it like this:

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* } 
}
grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes { '#'**1..6 }

    regex header {^^ <hashes> \h+ <span> <!after '#'> [\h* $<hashes>]? $$}
}

say Grammar::Headers.subparse("## Easier ##
", :rule<header>);

jjmerelo · Answer

Just change

  regex header {^^ <hashes> \h+ <span> [\h* $0]? $$}

to

  regex header {^^ (<hashes>) \h+ <span> [\h* $0]? $$}

So that the capture works. Thanks to Eugene Barsky for calling this.

brian d foy · Answer

I played with this a bit because I thought there were two interesting things you might do.

First, you can make hashes take an argument about how many it will match. That way you can do special things based on the level if you like. You can reuse hashes in different parts of the grammar where you require different but exact numbers of hash marks.

Next, the ~ stitcher allows you to specify that something will show up in the middle of two things so you can put those wrapper things next to each other. For example, to match (Foo) you could write '(' ~ ')' Foo. With that it looks like I came up with the same thing you posted:

use Grammar::Tracer;

role Like-a-word {
    regex like-a-word { \S+ }
}

role Span does Like-a-word {
    regex span { <like-a-word>[\s+ <like-a-word>]* }
}

grammar Grammar::Headers does Span {
    token TOP {^ <header> \v+ $}

    token hashes ( $n = 1 ) { '#' ** {$n} }

    regex header { [(<hashes(2)>) \h*] ~ [\h* $0] <span>  }
}

my $result = Grammar::Headers.parse( "## Easier ##
" );

say $result;

Grammar a bit too greedy in Perl6

Tags:

regex

regex-greedy

grammar

raku

jjmerelo

3 Answers

moritz

jjmerelo

brian d foy

Recent Activity

Donate For Us

Grammar a bit too greedy in Perl6

Tags:

regex

regex-greedy

grammar

raku

jjmerelo

3 Answers

moritz

jjmerelo

brian d foy

Related questions

Recent Activity

Donate For Us