Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing. As an example of what I am looking for, I want something that can parse the comments in here: <pre class="prettyprint"><code>#`{ foo {} bar } #`« woo woo » say #`( This is a ( long ) multiliner()) "You rock!" #`{{ { And don't forget the tricky repeating delimiters }} </code></pre> My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.

<h3>Matching your examples</h3> <pre class="prettyprint"><code>my %openers-closers = < { } « » ( ) >; # (many more in reality) my @openers = %openers-closers.keys; # { « ( ... my ($open, $close); # possibly multiple chars my token comment { '#`' <&open> <&middle> <&close> } my token open { # Store first delimiter char: Slurp as many as are repeated: ( ( @openers ) $0* ) # Store the full (possibly multiple character) delimiters: { $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars } } my token middle { :my $nest-level; # for tracking nesting [ # Continue if nested: or if not at unnested end delimiter: [ <?{$nest-level}> || <!&close> ] # Match either a nested delimiter: or a single character: ( $open || $close || . ) # Keep track of nesting: { $_ = ~$0.tail; # set topic to latest match in list $nest-level++ when $open; $nest-level-- when $close } ]* } my token close { $close } .say for $your-examples ~~ m:g / <.&comment> / </code></pre> displays: <pre class="prettyprint"><code>｢{ foo {} bar }｣｢« woo woo »｣｢( This is a ( long ) multiliner())｣｢{{ { And don't forget the tricky repeating delimiters }}｣ </code></pre> Hopefully the code is self-explanatory if you know P6 regexes. Please use the comments if you want clarification of any of it. <h3>Looking at related Rakudo source code</h3> I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.) But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case. As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12): <pre class="prettyprint"><code>#`{{ {{ And don't forget the tricky repeating delimiters } }} </code></pre> yields the rather LTA (Less Than Awesome) compiler error: <pre class="prettyprint"><code>Starter {{ is immediately followed by a combining codepoint... </code></pre> This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules. So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a <code>fail-terminator</code> method in the P6 grammar. (Perhaps not of interest to you but it is to me.) Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments: <ul> <li>The <code>comment:sym<#`(...)></code> token that parses these comments. This leads to:</li> <li>The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.</li> <li>The <code>quibble</code> token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:</li> <li> The <code>babble</code> token. This establishes a "start" and "stop" with this code: <pre class="prettyprint"><code>$=[<?before .>] { # Work out the delimiters. my $c := $/; my @delims := $c.peek_delimiters($c.target, $c.pos); my $start := @delims[0]; my $stop := @delims[1]; </code></pre> </li> </ul> The rule <code>peek_delimiters</code> is not in the P6 grammar file. A search in the Rakudo repo shows it's not anywhere in Rakudo or P6. A search in NQP yields a routine in nqp's grammar (from which the Perl 6 grammar inherits, which is why the <code>peek_delimiters</code> call works and why I looked in NQP when I didn't find it in Rakudo/P6). I'll stop at this point to draw a conclusion. <h3>Conclusion</h3> You've got a regex. It might work out as you intend. I don't know. If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

A Perl 6 Regex to match a Perl 6 delimited comment

Tags:

regex

raku

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.

As an example of what I am looking for, I want something that can parse the comments in here:

#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}

My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.

680

asked Mar 01 '19 04:03

zostay

1 Answers

Matching your examples

my %openers-closers = < { } « » ( ) >;        # (many more in reality)
my @openers         = %openers-closers.keys;  # { « ( ...
my ($open, $close);                           # possibly multiple chars

my token comment { '#`' <&open> <&middle> <&close> }

my token open {
  # Store first delimiter char:   Slurp as many as are repeated:
  ( ( @openers )                  $0* )

  # Store the full (possibly multiple character) delimiters:
  { $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}

my token middle {
  :my $nest-level; # for tracking nesting
  [
    # Continue if nested:  or if not at unnested end delimiter:
    [ <?{$nest-level}>     ||    <!&close> ]

    # Match either a nested delimiter:  or a single character: 
    ( $open || $close                   || . )

    # Keep track of nesting:
    { $_ = ~$0.tail; # set topic to latest match in list 
      $nest-level++ when $open; $nest-level-- when $close } 
  ]*
}

my token close { $close }

.say for $your-examples ~~ m:g / <.&comment> /

displays:

｢{ foo {} bar }｣
｢« woo woo »｣
｢(
This is a (
long )
multiliner())｣
｢{{ { And don't forget the tricky repeating delimiters }}｣

Hopefully the code is self-explanatory if you know P6 regexes. Please use the comments if you want clarification of any of it.

Looking at related Rakudo source code

I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)

But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.

As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):

#`{{ {{ And don't forget the tricky repeating delimiters  } }}

yields the rather LTA (Less Than Awesome) compiler error:

Starter {{ is immediately followed by a combining codepoint...

This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.

So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the P6 grammar. (Perhaps not of interest to you but it is to me.)

Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:

The comment:sym<#`(...)> token that parses these comments. This leads to:
The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.
The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:

The babble token. This establishes a "start" and "stop" with this code:

$<B>=[<?before .>]
{
    # Work out the delimiters.
    my $c := $/;
    my @delims := $c.peek_delimiters($c.target, $c.pos);
    my $start := @delims[0];
    my $stop  := @delims[1];

The rule peek_delimiters is not in the P6 grammar file.

A search in the Rakudo repo shows it's not anywhere in Rakudo or P6.

A search in NQP yields a routine in nqp's grammar (from which the Perl 6 grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/P6).

I'll stop at this point to draw a conclusion.

Conclusion

You've got a regex. It might work out as you intend. I don't know.

If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

183

answered Nov 09 '22 07:11

raiph

Related questions
                            
                                Regex, ignoring pattern if it's in quotes
                            
                                Why does this C++11 std::regex example throw a regex_error exception? [duplicate]
                            
                                Regex for a username increases CPU consumption
                            
                                Lookahead in BigQuery Regexp
                            
                                On libc++, why does regex_match("tournament", regex("tour|to|tournament")) fail?
                            
                                Indices of all matches of a regex
                            
                                How to match RFC3339 timestamp using Regex?
                            
                                Difference between non-greedy search and negated character set
                            
                                Lex/Flex :Regular expression for string literals in C/C++?
                            
                                Regex for replacing specific characters before and after specific substring
                            
                                Query MongoDB with a regex expression against an ObjectId
                            
                                Ignore existing spaces in converting CamelCase to string with spaces
                            
                                Deprecated left curly bracket in Perl regex - exactly when?
                            
                                Can a regular expression itself be parsed with a regular expression? [duplicate]
                            
                                How do I create a dynamic capturing group in regex?
                            
                                php detect and convert dates from a string
                            
                                Regex for Guest VS Registered User Funnel in Google Analytics
                            
                                How would I search for text that contains emojis?
                            
                                How to match the start of a line using a Visual Studio Code regex?
                            
                                Android Studio Kotlin regex different than expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With