Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to enumerate all Unicode canonically equivalent sequences in Perl?

Tags:

unicode

perl

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?

For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:

0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD

(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)

like image 923
Bob Hallissy Avatar asked Jun 21 '11 00:06

Bob Hallissy


1 Answers

Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.

Otherwise it gets a little messy.

You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.

This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...

like image 60
mirod Avatar answered Nov 04 '22 20:11

mirod