Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Negated Named Regex, or Character Class Interpolation in Raku

Tags:

regex

raku

I'm trying to parse a quoted string. Something like this:

say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;

(From https://docs.raku.org/language/regexes "Enumerated character classes and ranges") But... I want more that one type of quote. Something like this made up syntax that doesn't work:

  token attribute_value { <quote> ($<-quote>) $<quote> };
  token quote           { <["']> };

I found this discussion which is another approach, but it didn't seem to go anywhere: https://github.com/Raku/problem-solving/issues/97. Is there any way of doing this kind of thing? Thanks!

Update 1

I was not able to get @user0721090601's "multi token" solution to work. My first attempt yielded:

$ ./multi-token.raku 
No such method 'quoted_string' for invocant of type 'QuotedString'
  in block <unit> at ./multi-token.raku line 16

After doing some research I added proto token quoted_string {*}:

#!/usr/bin/env raku

use Grammar::Tracer;

grammar QuotedString {
  proto token quoted_string {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']> }
  multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]> }
  token quote         { <["']> }
}

my $string = '"foo"';

my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;
$ ./multi-token.raku 
quoted_string
* FAIL
(Any)

I'm still learning Raku, so I could be doing something wrong.

Update 2

D'oh! Thanks to @raiph for pointing this out. I forgot to put a quantifier on <-[']> and <-["]>. That's what I get for copy/pasting without thinking! Works find when you do it right:

#!/usr/bin/env raku

use Grammar::Tracer;

grammar QuotedString {
  proto token quoted_string (|) {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']>+ }
  multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]>+ }
  token quote         { <["']> }
}

my $string = '"foo"';

my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;

Update 3

Just to put a bow on this...

#!/usr/bin/env raku

grammar NegativeLookahead {
  token quoted_string { <quote> $<string>=([<!quote> .]+) $<quote> }
  token quote         { <["']> }
}

grammar MultiToken {
  proto token quoted_string (|) {*}
  multi token quoted_string:sym<'> { <sym> ~ <sym> $<string>=(<-[']>+) }
  multi token quoted_string:sym<"> { <sym> ~ <sym> $<string>=(<-["]>+) }
}

use Bench;

my $string = "'foo'";

my $bench = Bench.new;
$bench.cmpthese(10000, {
  negative-lookahead =>
    sub { NegativeLookahead.parse($string, :rule<quoted_string>); },
  multi-token        =>
    sub { MultiToken.parse($string, :rule<quoted_string>); },
});
$ ./bench.raku
Benchmark: 
Timing 10000 iterations of multi-token, negative-lookahead...
multi-token: 0.779 wallclock secs (0.759 usr 0.033 sys 0.792 cpu) @ 12838.058/s (n=10000)
negative-lookahead: 0.912 wallclock secs (0.861 usr 0.048 sys 0.909 cpu) @ 10967.522/s (n=10000)
O--------------------O---------O-------------O--------------------O
|                    | Rate    | multi-token | negative-lookahead |
O====================O=========O=============O====================O
| multi-token        | 12838/s | --          | -20%               |
| negative-lookahead | 10968/s | 25%         | --                 |
O--------------------O---------O-------------O--------------------O

I'll be going with the "multi token" solution. Thanks everyone!

like image 910
JustThisGuy Avatar asked Dec 05 '20 23:12

JustThisGuy


2 Answers

There are a few different approaches that you can take — which one is best will probably depend on the rest of the structure you're employing.

But first an observation on your current solution and why opening it up to others won't work this way. Consider the string 'value". Should that parse? The structure you laid out actually would match it! That's because each <quote> token will match either a single or double quote.

Dealing with the inner

The simplest solution is to make your inner part a non-greedy wildcard:

<quote> (.*?) <quote>

This will stop the match as soon as you reach quote again. Also note the alternative syntax using a tilde that lets the two terminal bits be closer together:

<quote> ~ <quote> (.*?)

Your initial attempt wanted to use a sort of non-match. This does exist in the form of an assertion, <!quote> which will fail if a <quote> is found (which needn't be just a character, by any thing arbitrarily complex). It doesn't consume, though, so you need to provide that separately. For instance

[<!quote> .]*

Will check that something is NOT a quote, and then consume the next character.

Lastly, you could use either of the two approaches and use a <content> token that handles in the inside. This is actually a great approach if you intend to later do more complex things (e.g. escape characters).

Avoiding a mismatch

As I noted, your solution would parse mismatched quotes. So we need to have a way to ensure that the quote we are (not) matching is the same as the start one. One way to do this is using a multi token:

proto token attribute_value (|) { * }
multi token attribute_value:sym<'> { <sym> ~ <sym> <-[']> }
multi token attribute_value:sym<"> { <sym> ~ <sym> <-["]> }

(Using the actual token <sym> is not require, you could write it as { \' <-[']> \'} if you wanted).

Another way you could do this is by passing a parameter (either literally, or via dynamic variables). For example, you could make write the attribute_value as

token attribute_value {
    $<start-quote>=<quote>      # your actual start quote
    :my $*end-quote;            # define the variable in the regex scope
    { $*end-quote = ... }       # determine the requisite end quote (e.g. ” for “)
    <attribute_value_contents>  # handle actual content
    $*end-quote                 # fancy end quote
}

token attribute_value_contents {
    # We have access to $*end-quote here, so we can use
    # either of the techniques we've described before
    # (a) using a look ahead
    [<!before $*end-quote> .]*
    # (b) being lazy (the easier)
    .*?
    # (c) using another token (described below)
    <attr_value_content_char>+
}

I mention the last one because you can even further delegate if you ultimately decide to allow for escape characters. For example, you could then do

proto token attr_value_content_char (|) { * }
multi token attr_value_content_char:sym<escaped> { \\ $*end-quote }
multi token attr_value_content_char:sym<literal> { . <?{ $/ ne $*end-quote }> }

But if that's overkill for what you're doing, ah well :-)

Anyways, there are probably other ways that didn't jump to my mind that others can think of, but that should hopefully put you on the right path. (also some of this code is untested, so there may be slight errors, apologies for that)

like image 143
user0721090601 Avatar answered Nov 04 '22 05:11

user0721090601


Assuming that you just want to match the same quote character again.

token attribute-value { <string> }

token string {
  # match <quote> and expect to end with "$<quote>"
  <quote> ~ "$<quote>"

  [
    # update match structure in $/ otherwise "$<quote>" won't work
    {}

    <!before "$<quote>"> # next character isn't the same as $<quote>

    .    # any character

  ]*     # any number of times
}

token quote { <["']> }

For anything more complex use something like the $*end-quote dynamic variable from the earlier answer.

like image 37
Brad Gilbert Avatar answered Nov 04 '22 04:11

Brad Gilbert