How to enumerate all Unicode canonically equivalent sequences in Perl?

Question

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?

For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:

0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD

(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)

mirod · Accepted Answer

Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.

Otherwise it gets a little messy.

You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.

This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...

How to enumerate all Unicode canonically equivalent sequences in Perl?

Tags:

unicode

perl

Bob Hallissy

1 Answers

mirod

Recent Activity

Donate For Us

How to enumerate all Unicode canonically equivalent sequences in Perl?

Tags:

unicode

perl

Bob Hallissy

1 Answers

mirod

Related questions

Recent Activity

Donate For Us