Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä
, U+00E4
) or two and more combined symbols (like p̄
and ḏ̣
). This little code
my @symb;
@symb.push("ä");
@symb.push("p" ~ 0x304.chr); # "p̄"
@symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for @symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä
. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄
into p
and Combining Macron U+0304
. E.g. something like the following in bash
:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str
class. To do what you are asking in (1), you can use the samemark
method/routine.
Per the documentation:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D) method samemark(Str:D: Str:D $pattern --> Str:D)
Returns a copy of
$string
with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in$pattern
. If$string
is longer than$pattern
, the remaining characters in$string
receive the same mark/accent as the last character in$pattern
. If$pattern
is empty no changes will be made.Examples:
say 'åäö'.samemark('aäo'); # OUTPUT: «aäo» say 'åäö'.samemark('a'); # OUTPUT: «aao» say samemark('Pêrl', 'a'); # OUTPUT: «Perl» say samemark('aöä', ''); # OUTPUT: «aöä»
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords
method to get a List
(technically a Positional
) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)»
You can use the uniname
method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
or just use the uninames
method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER PCOMBINING MACRON»
If you just want the number of codepoints in the string, you can use codes
:
say "p̄".codes; # OUTPUT: «2»
This is different than chars
, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1»
Also see @hobbs' answer using NFD
.
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino
The .NFD
method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str
then assembles those codepoints back into a string.
You can also put these pieces together to answer your second question; e.g.:
$in.NFD.map: { Uni.new($_).Str }
will return a list of 1-character strings, each with a single decomposed codepoint, or
$in.NFD.map(&uniname).join("\n")
will make a nice little unicode debugger.
I can't say this is better or faster, but I strip diacritics in this way:
my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With