Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A Perl 6 Regex to match a Perl 6 delimited comment

Tags:

regex

raku

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.

As an example of what I am looking for, I want something that can parse the comments in here:

#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}

My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.

like image 680
zostay Avatar asked Mar 01 '19 04:03

zostay


People also ask

How do I match a pattern in Perl?

m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.

What is \W in Perl regex?

Under /a , \d always means precisely the digits "0" to "9" ; \s means the five characters [ \f\n\r\t] , and starting in Perl v5. 18, the vertical tab; \w means the 63 characters [A-Za-z0-9_] ; and likewise, all the Posix classes such as [[:print:]] match only the appropriate ASCII-range characters.

How do I match a number in Perl?

The Special Character Classes in Perl are as follows: Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”.

What is K in regex?

\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match. To make the explanation short, consider the following simple Regex: a\Kb.


1 Answers

Matching your examples

my %openers-closers = < { } « » ( ) >;        # (many more in reality)
my @openers         = %openers-closers.keys;  # { « ( ...
my ($open, $close);                           # possibly multiple chars

my token comment { '#`' <&open> <&middle> <&close> }

my token open {
  # Store first delimiter char:   Slurp as many as are repeated:
  ( ( @openers )                  $0* )

  # Store the full (possibly multiple character) delimiters:
  { $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}

my token middle {
  :my $nest-level; # for tracking nesting
  [
    # Continue if nested:  or if not at unnested end delimiter:
    [ <?{$nest-level}>     ||    <!&close> ]

    # Match either a nested delimiter:  or a single character: 
    ( $open || $close                   || . )

    # Keep track of nesting:
    { $_ = ~$0.tail; # set topic to latest match in list 
      $nest-level++ when $open; $nest-level-- when $close } 
  ]*
}

my token close { $close }

.say for $your-examples ~~ m:g / <.&comment> /

displays:

「{ foo {} bar }」
「« woo woo »」
「(
This is a (
long )
multiliner())」
「{{ { And don't forget the tricky repeating delimiters }}」

Hopefully the code is self-explanatory if you know P6 regexes. Please use the comments if you want clarification of any of it.

Looking at related Rakudo source code

I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)

But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.

As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):

#`{{ {{ And don't forget the tricky repeating delimiters  } }}

yields the rather LTA (Less Than Awesome) compiler error:

Starter {{ is immediately followed by a combining codepoint...

This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.

So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the P6 grammar. (Perhaps not of interest to you but it is to me.)

Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:

  • The comment:sym<#`(...)> token that parses these comments. This leads to:

  • The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.

  • The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:

  • The babble token. This establishes a "start" and "stop" with this code:

    $<B>=[<?before .>]
    {
        # Work out the delimiters.
        my $c := $/;
        my @delims := $c.peek_delimiters($c.target, $c.pos);
        my $start := @delims[0];
        my $stop  := @delims[1];
    

The rule peek_delimiters is not in the P6 grammar file.

A search in the Rakudo repo shows it's not anywhere in Rakudo or P6.

A search in NQP yields a routine in nqp's grammar (from which the Perl 6 grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/P6).

I'll stop at this point to draw a conclusion.

Conclusion

You've got a regex. It might work out as you intend. I don't know.

If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

like image 183
raiph Avatar answered Nov 09 '22 07:11

raiph