I need to remove diacritical marks from a string using Perl 6. I tried doing this:
my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);
I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".
I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.
I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.
Thanks!
My regex-fu is weak, so I'd go with a less magical solution.
First, you can remove all marks via samemark:
'חוּם'.samemark('a')
Second, you can decompose the graphemes via .NFD
and operate on individual codepoints - eg only keeping values with property Grapheme_Base
- and then recompose the string:
Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str
In case of mixed strings, stripping marks from Hebrew characters only could look like this:
$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
Here is a simple approach:
my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my @ords;
for $hum.ords {
@ords.push($_) if $min ≤ $_ ≤ $max;
}
say join('', @ords.map: { .chr });
Output:
חום
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With