Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to negate/subtract regexes (not only character classes) in Perl 6?

Tags:

raku

It's possible to make a conjunction, so that the string matches 2 or more regex patterns.

> "banana" ~~ m:g/ . a && b . /
(「ba」)

Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:

> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(「c」 「m」 「l」)

But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:

> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(「ana」)
like image 448
Eugene Barsky Avatar asked Nov 20 '17 16:11

Eugene Barsky


People also ask

How to subtract in regex?

The subtraction works on the whole class. E.g. [\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] matches all uppercase and lowercase Unicode letters, except any ASCII letters. The \p{IsBasicLatin} is subtracted from the combination of \p{Ll}\p{Lu} rather than from \p{Lu} alone. This regex will not match abc.

How do I match a pattern in Perl?

m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.


2 Answers

TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").


Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:

say "banana" ~~ m:g/ <!before ban> . ** 3 / # (「ana」)

The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.

(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)


It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)

One way that will work for fairly simple patterns is using a negative after assertion:

say "banana" ~~ m:g/ . ** 3 <!after ban> / # (「ana」)

However, if the negative pattern is sufficiently complex you may need to use this formulation:

say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (「ana」)

This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").

like image 104
raiph Avatar answered Oct 24 '22 17:10

raiph


What does it even mean to "negate" a regex?

When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".

If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.

If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.

If you have another definition of negation in mind, please share it.

like image 39
moritz Avatar answered Oct 24 '22 17:10

moritz