Using .subst with a partial regex match

Question

my $book1 = "Don Quixote- Miguel de Cervantes";
my $book2 = "Les Misérables -Victor Hugo";
my $book3 = "War and Peace - Leo Tolstoy";

I want to use .subst to change "- " to " - " in $book1 and " -" to " - " in $book2. The problem is that I can't find the right regex to use with .subst. I could to use something different to a regex but I would like to use .subst. I can use different regexes for both strings but both should ignore the " - " in $book3.

Sorry for the probably basic question. I've been trying different things but I always destroy part of the text.

chenyf · Accepted Answer

you can use trans method:

my $book1 = "Don Quixote- Miguel de Cervantes";
my $book2 = "Les Misérables -Victor Hugo";
my $book3 = "War and Peace - Leo Tolstoy";

for ($book1, $book2, $book3) -> $b {
    say $b.trans([/<wb> '- '/, /' -' <wb>/] => [' - ']);
}

wb is a word boundary.

raiph · Answer

TL;DR Another option to consider is using the <( and )> capture markers to pick out just the bit you want to replace.

A "literal" interpretation of your Q

Matching strictly per your examples:

/   \C[space]   <(   '- '   |   ' -'   )>   \C[space]   /

The syntax \c[...] specifies one or more characters by using their Unicode names inside the square brackets (in this case the classic ASCII space character).¹

In this pattern I've used \C[...] (uppercase C, not lowercase c). There is a range of Raku "backslash" atoms and they all have lowercase and uppercase variants, where the uppercase variant matches any character except the one(s) matched by the lowercase variant. So \C[space] matches any character other than the ASCII space character. See \c / \C for more info.
The <( capture marker marks the start point of the regex's capture. Likewise )> marks the endpoint.

Without them, when the pattern matches, the whole match would be captured, which would include whatever non whitespace character matches the \C[space] atom. We don't want that. So we use these markers to restrict what we capture.

Btw, each marker is independent. The above pattern matches \C[space] '- ' or '- ' \C[space]. If the pattern to the left of the | matches, only the <( has an impact, omitting whatever matched \C[space], and capturing until the end of the match, which for this pattern stops at the |. If the pattern to the right matches, capturing starts immediately after the | and ends at the )>.
The | is Raku's parallel (aka "longest token match" -- LTM) pattern alternation operator, an alternative to the traditional sequential pattern alternation operator (which in Raku is written ||). In this case the set of substrings that the two operators will and won't match is the same, so it makes no difference which is used. But | is shorter than ||; when the match set is the same it's typically faster; and when the match sets are different it's often | that's desirable. So I use it by default unless I know I need the traditional sequential alternation logic (try pattern on left of || first; if that fails, try the pattern on the right of the ||).

A "per its spirit?" interpretation of your Q

Matching more flexibly regarding whitespace:

/   \S   <(   '-' \s+   |   \s+ '-'   )>   \S   /

The \S atoms match any character that is not categorized by Unicode as being a whitespace character. (I use Raku, or tools such as this character property lookup web page, to explore what Unicode makes of a character.)

Comparing \C[space], \S, and <wb>:
- \C[space] matches any character, including whitespace characters, with the sole exception of an ASCII space. My guess is it'll be the fastest of the three.
- \S matches any non-whitespace. My guess is it'll be faster than <wb>.
- <wb> matches between characters. Also it'll match before the first character in a string, and after the last one. So @chenyf's pattern would match and change '- foo...' to ' - foo...' and '...bar -' to '...bar - ' whereas the patterns with \C[space] or \S would not match at the start/end of those strings.
The \s+ atoms match one or more whitespace characters.

Footnotes

¹ The naming is case insensitive. Multiple characters are separated by commas. \c[...] also works in a double quoted string (but not \C[...]).

Using .subst with a partial regex match

Tags:

raku

Mariano R.

2 Answers

chenyf

A "literal" interpretation of your Q

A "per its spirit?" interpretation of your Q

Footnotes

raiph

Recent Activity

Donate For Us

Using .subst with a partial regex match

Tags:

raku

Mariano R.

2 Answers

chenyf

A "literal" interpretation of your Q

A "per its spirit?" interpretation of your Q

Footnotes

raiph

Related questions

Recent Activity

Donate For Us