Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match a hex array in perl6 grammar

I have a string like "39 3A 3B 9:;" and i want to extract "39, 3A, 3B"

I have tried

my $a = "39 3A 3B  9:;";
grammar Hex {
    token TOP { <hex_array>+ .* }
    token hex_array { <[0..9 A..F]> " " }
};
Hex.parse($a);

But this doesn't seem to work. And even this doesn't seem to work.

my $a = "39 3A 3B ";
grammar Hex {
    token TOP { <hex_array>+ }
    token hex_array { <[0..9 A..F]> " " }
};
Hex.parse($a);

I did try Grammar::Tracer both TOP and hex_array fail

TOP
|  hex_array
|  * FAIL
* FAIL
like image 506
tejas Avatar asked Jun 03 '19 06:06

tejas


1 Answers

<[abcdef...]> in a P6 regex is a "character class" in the match-one-character sense.1

The idiomatic way to get what you want is to use the ** quantifier:

my $a = "39 3A 3B ";
grammar Hex {
  token TOP { <hex_array>+ }
  token hex_array { <[0..9 A..F]>**1..2 " " }
};
Hex.parse($a);

The rest of this answer is "bonus" material on why and how to use rules.

You are of course perfectly free to match whitespace situations by including whitespace patterns in arbitrary individual tokens, like you did with " " in your hex_array token.

However, it's good practice to use rules instead when appropriate -- which is most of the time.

First, use ws instead of " ", \s* etc.

Let's remove the space in the second token and move it instead to the first one:

  token TOP { [ <hex_array> " " ]+ }
  token hex_array { <[0..9 A..F]>**1..2 }

We've added square bracketing ([...]) that combines the hex_array and a space and then applied the + quantifier to that combined atom. It's a simple change, and the grammar continues to work as before, matching the space as before, except now the space won't be captured by the hex_array token.

Next, let's switch to using the built in ws token:

  token TOP { [ <hex_array> <.ws> ]+ }

The default <ws> is more generally useful, in desirable ways, than \s*.2 And if the default ws doesn't do what you need you can specify your own ws token.

We've used <.ws> instead of <ws> because, like \s*, use of <.ws> avoids additional capture of whitespace that would likely just clutter up the parse tree and waste memory.

One often wants something like <.ws> after almost every token in higher level parsing rules that string tokens together. But if it were just explicitly written like that it would be highly repetitive and distracting <.ws> and [ ... <.ws> ] boilerplate. To avoid that there's a built in shortcut for implicitly expressing a default assumption of inserting the boilerplate for you. This shortcut is the rule declarator, which in turn uses :sigspace.

Using rule (which uses :sigspace)

A rule is exactly the same as a token except that it switches on :sigspace at the start of the pattern:

rule  {           <hex_array>+ }
token { :sigspace <hex_array>+ } # exactly the same thing

Without :sigspace (so in tokens and regexs by default), all literal spaces in a pattern (unless you quote them) are ignored. This is generally desirable for readable patterns of individual tokens because they typically specify literal things to match.

But once :sigspace is in effect, spaces after atoms become "significant" -- because they're implicitly converted to <.ws> or [ ... <.ws> ] calls. This is desirable for readable patterns specifying sequences of tokens or subrules because it's a natural way to avoid the clutter of all those extra calls.

The first pattern below will match one or more hex_array tokens with no spaces being matched either between them or at the end. The last two will match one or more hex_arrays, without intervening spaces, and then with or without spaces at the very end:

  token TOP {           <hex_array>+ }
  #          ^ ignored ^            ^ ignored

  token TOP { :sigspace <hex_array>+ }
  #          ^ ignored ^            ^ significant

  rule TOP  {           <hex_array>+ }
  #          ^ ignored ^            ^ significant

NB. Adverbs (like :sigspace) aren't atoms. Spaces immediately before the first atom (in the above, spaces before <hex_array>) are never significant (regardless of whether :sigspace is or isn't in effect). But thereafter, if :sigspace is in effect, all non-quoted spacing in the pattern is "significant" -- that is, it's converted to <.ws> or [ ... <.ws> ].

In the above code, the second token and the rule would match a single hex_array with spaces after it because the space immediately after the + and before the } means the pattern is rewritten to:

  token TOP { <hex_array>+ <.ws> }

But this rewritten token won't match if your input has multiple hex_array tokens with one or more spaces between them. Instead you would want to write:

  rule TOP { <hex_array> + }
  # ignored ^           ^ ^ both these spaces are significant

which is rewritten to:

  token TOP { [ <hex_array> <.ws> ]+ <.ws> }

This will match your input.

Conclusion

So, after all that apparent complexity, which is really just me being exhaustively precise, I'm suggesting you might write your original code as:

my $a = "39 3A 3B ";
grammar Hex {
  rule TOP { <hex_array> + }
  token hex_array { <[0..9 A..F]>**1..2 }
};
Hex.parse($a);

and this would match more flexibly than your original (I'm presuming that would be a good thing though of course it might not be for some use cases) and would perhaps be easier to read for most P6ers.

Finally, to reinforce how to avoid two of the three gotchyas of rules, see also What's the best way to be lax on whitespace in a perl6 grammar?. (The third gotchya is whether you need to put a space between an atom and a quantifier, as with the space between the <hex_array> and the + in the above.)

Footnotes

1 If you want to match multiple characters, then append a suitable quantifier to the character class. This is a sensible way for things to be, and the assumed behavior of a "character class" according to Wikipedia. Unfortunately the P6 doc currently confuses the issue, eg lumping together both genuine character classes and other rules that match multiple characters under the heading Predefined character classes.

2 The default ws rule is designed to match between words, where a "word" is a contiguous sequence of letters (Unicode category L), digits (Nd), or underscores. In code, it's specified as:

regex ws { <!ww> \s* }

ww is a "within word" test. So <!ww> means not within a "word". <ws> will always succeed where \s* would -- except that, unlike \s*, it won't succeed in the middle of a word. (Like any other atom quantified with a *, a plain \s* will always match because it matches any number of spaces, including none at all.)

like image 118
raiph Avatar answered Sep 23 '22 10:09

raiph