How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

Question

I need to remove diacritical marks from a string using Perl 6. I tried doing this:

my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);

I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".

I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.

I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.

Thanks!

Christoph · Accepted Answer

My regex-fu is weak, so I'd go with a less magical solution.

First, you can remove all marks via samemark:

'חוּם'.samemark('a')

Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:

Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str

In case of mixed strings, stripping marks from Hebrew characters only could look like this:

$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));

Håkon Hægland · Answer

Here is a simple approach:

my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my @ords;
for $hum.ords {
    @ords.push($_) if $min ≤ $_ ≤ $max; 
}
say join('', @ords.map: { .chr });

Output:

חום

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

Tags:

regex

unicode

raku

Amir E. Aharoni

2 Answers

Christoph

Håkon Hægland

Recent Activity

Donate For Us

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

Tags:

regex

unicode

raku

Amir E. Aharoni

2 Answers

Christoph

Håkon Hægland

Related questions

Recent Activity

Donate For Us