Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:

my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);

I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".

I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.

I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.

Thanks!

like image 797
Amir E. Aharoni Avatar asked Sep 10 '18 13:09

Amir E. Aharoni


2 Answers

My regex-fu is weak, so I'd go with a less magical solution.

First, you can remove all marks via samemark:

'חוּם'.samemark('a')

Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:

Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str

In case of mixed strings, stripping marks from Hebrew characters only could look like this:

$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
like image 82
Christoph Avatar answered Oct 20 '22 13:10

Christoph


Here is a simple approach:

my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my @ords;
for $hum.ords {
    @ords.push($_) if $min ≤ $_ ≤ $max; 
}
say join('', @ords.map: { .chr });

Output:

חום
like image 27
Håkon Hægland Avatar answered Oct 20 '22 11:10

Håkon Hægland