Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do LATIN CAPITAL LETTER I (U+0049) and ROMAN NUMERAL ONE (U+2160) have unicode compatibility equivalence?

Tags:

unicode

Unicode defines two kinds of equivalence 000 canonical equivalence and compatibility equivalence. The example in Unicode Technical Annex #15 for compatibility equivalence is SUPERSCRIPT ONE (U+00B9) and DIGIT ONE (U+0031). It doesn't discuss characters that are visually indistinguishable.

I am curious if characters that are visually indistinguishable have compatibility equivalence under the standard.

Thanks..

like image 897
vy32 Avatar asked Jan 12 '12 19:01

vy32


3 Answers

ᴇᴅɪᴛ: Added exactly what the original question is looking for at the bottom. This is really cool.


The answer to your question about ʀᴏᴍᴀɴ ɴᴜᴍᴇʀᴀʟ ᴏɴᴇ and ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ is YES. Here’s a quick way to check:

$ perl -Mcharnames=:full -MUnicode::Normalize -le 'print
   NFKD "\N{ROMAN NUMERAL ONE}"  eq  NFKD "\N{LATIN CAPITAL LETTER I}"'
1

However, the answer to your question as to whether characters that are visually indistinguishable have compatibility equivalence is most definitely NO!

For example, ᴄʜᴇʀᴏᴋᴇᴇ ʟᴇᴛᴛᴇʀ ɢᴏ (Ꭺ) looks like ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (A), but is certainly not NFKD equivalent. Similarly with ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀʟᴘʜᴀ (Α) and ᴄʏʀɪʟʟɪᴄ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (А) not being NFKD equivalent. There are effectively uncountably many (well, I can’t count them :) such issues. The only code points that are NFKD-equiv to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ, for example, are:

U+00041 ‭ A  GC=Lu SC=Latin        LATIN CAPITAL LETTER A
U+01D2C ‭ ᴬ  GC=Lm SC=Latin        MODIFIER LETTER CAPITAL A
U+024B6 ‭ Ⓐ  GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER A
U+0FF21 ‭ A GC=Lu SC=Latin        FULLWIDTH LATIN CAPITAL LETTER A
U+1D400 ‭ 𝐀  GC=Lu SC=Common       MATHEMATICAL BOLD CAPITAL A
U+1D434 ‭ 𝐴  GC=Lu SC=Common       MATHEMATICAL ITALIC CAPITAL A
U+1D468 ‭ 𝑨  GC=Lu SC=Common       MATHEMATICAL BOLD ITALIC CAPITAL A
U+1D49C ‭ 𝒜  GC=Lu SC=Common       MATHEMATICAL SCRIPT CAPITAL A
U+1D4D0 ‭ 𝓐  GC=Lu SC=Common       MATHEMATICAL BOLD SCRIPT CAPITAL A
U+1D504 ‭ 𝔄  GC=Lu SC=Common       MATHEMATICAL FRAKTUR CAPITAL A
U+1D538 ‭ 𝔸  GC=Lu SC=Common       MATHEMATICAL DOUBLE-STRUCK CAPITAL A
U+1D56C ‭ 𝕬  GC=Lu SC=Common       MATHEMATICAL BOLD FRAKTUR CAPITAL A
U+1D5A0 ‭ 𝖠  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF CAPITAL A
U+1D5D4 ‭ 𝗔  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD CAPITAL A
U+1D608 ‭ 𝘈  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF ITALIC CAPITAL A
U+1D63C ‭ 𝘼  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A
U+1D670 ‭ 𝙰  GC=Lu SC=Common       MATHEMATICAL MONOSPACE CAPITAL A
U+1F130 ‭ 🄰  GC=So SC=Common       SQUARED LATIN CAPITAL LETTER A

Similarly, here are the codepoints that are NFKD equiv to the ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ you were looking at:

U+00049 ‭ I  GC=Lu SC=Latin        LATIN CAPITAL LETTER I
U+01D35 ‭ ᴵ  GC=Lm SC=Latin        MODIFIER LETTER CAPITAL I
U+02110 ‭ ℐ  GC=Lu SC=Common       SCRIPT CAPITAL I
U+02111 ‭ ℑ  GC=Lu SC=Common       BLACK-LETTER CAPITAL I
U+02160 ‭ Ⅰ  GC=Nl SC=Latin        ROMAN NUMERAL ONE
U+024BE ‭ Ⓘ  GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER I
U+0FF29 ‭ I GC=Lu SC=Latin        FULLWIDTH LATIN CAPITAL LETTER I
U+1D408 ‭ 𝐈  GC=Lu SC=Common       MATHEMATICAL BOLD CAPITAL I
U+1D43C ‭ 𝐼  GC=Lu SC=Common       MATHEMATICAL ITALIC CAPITAL I
U+1D470 ‭ 𝑰  GC=Lu SC=Common       MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 ‭ 𝓘  GC=Lu SC=Common       MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 ‭ 𝕀  GC=Lu SC=Common       MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 ‭ 𝕴  GC=Lu SC=Common       MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 ‭ 𝖨  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC ‭ 𝗜  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 ‭ 𝘐  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 ‭ 𝙄  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 ‭ 𝙸  GC=Lu SC=Common       MATHEMATICAL MONOSPACE CAPITAL I
U+1F138 ‭ 🄸  GC=So SC=Common       SQUARED LATIN CAPITAL LETTER I

Notice there’s no ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪᴏᴛᴀ there, just as one example.

You can’t use NFKD to find lookalikes, and some things that are NKFD equiv don’t look much alike. So you can’t do it that way in the general case. It’s not a problem you can even begin to look at without looking at actual fonts.

I believe ICU has an extended, non-standard property for this, like \p{X-Confusable=A}. I downloaded their datafiles for this, but haven’t played with it much yet.


Update

It turns out that UTS #39, Unicode Security Mechanisms, has exactly what you are looking for. If you fetch its raw, plaintext datafiles, you will be able to determine which code points are potentially confusable with one another.

For example, in the text earlier in this message, I enumerated the code points that were NFKD equivalent to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ, and pointed out that many potential confusables were missing from that set. That’s because the NFKD mapping is not designed to detect confusables. However, the datafiles from UTS#39 very much are designed for just that very purpose.

To redo my ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ enumeration, updating it to handle all code points that UTS#39 deems mutually confusable with it, we have these, formatted using unichars and sorted in order of the Unicode Collation Algorithm using ucsort:

U+0007C ‭ |  GC=Sm SC=Common       VERTICAL LINE
U+02223 ‭ ∣  GC=Sm SC=Common       DIVIDES
U+0FFE8 ‭ │  GC=So SC=Common       HALFWIDTH FORMS LIGHT VERTICAL
U+00031 ‭ 1  GC=Nd SC=Common       DIGIT ONE
U+1D7CF ‭ 𝟏  GC=Nd SC=Common       MATHEMATICAL BOLD DIGIT ONE
U+1D7D9 ‭ 𝟙  GC=Nd SC=Common       MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
U+1D7E3 ‭ 𝟣  GC=Nd SC=Common       MATHEMATICAL SANS-SERIF DIGIT ONE
U+1D7ED ‭ 𝟭  GC=Nd SC=Common       MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
U+1D7F7 ‭ 𝟷  GC=Nd SC=Common       MATHEMATICAL MONOSPACE DIGIT ONE
U+00049 ‭ I  GC=Lu SC=Latin        LATIN CAPITAL LETTER I
U+0FF29 ‭ I GC=Lu SC=Latin        FULLWIDTH LATIN CAPITAL LETTER I
U+02160 ‭ Ⅰ  GC=Nl SC=Latin        ROMAN NUMERAL ONE
U+02110 ‭ ℐ  GC=Lu SC=Common       SCRIPT CAPITAL I
U+02111 ‭ ℑ  GC=Lu SC=Common       BLACK-LETTER CAPITAL I
U+1D408 ‭ 𝐈  GC=Lu SC=Common       MATHEMATICAL BOLD CAPITAL I
U+1D43C ‭ 𝐼  GC=Lu SC=Common       MATHEMATICAL ITALIC CAPITAL I
U+1D470 ‭ 𝑰  GC=Lu SC=Common       MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 ‭ 𝓘  GC=Lu SC=Common       MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 ‭ 𝕀  GC=Lu SC=Common       MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 ‭ 𝕴  GC=Lu SC=Common       MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 ‭ 𝖨  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC ‭ 𝗜  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 ‭ 𝘐  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 ‭ 𝙄  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 ‭ 𝙸  GC=Lu SC=Common       MATHEMATICAL MONOSPACE CAPITAL I
U+00196 ‭ Ɩ  GC=Lu SC=Latin        LATIN CAPITAL LETTER IOTA
U+0006C ‭ l  GC=Ll SC=Latin        LATIN SMALL LETTER L
U+0FF4C ‭ l GC=Ll SC=Latin        FULLWIDTH LATIN SMALL LETTER L
U+0217C ‭ ⅼ  GC=Nl SC=Latin        SMALL ROMAN NUMERAL FIFTY
U+02113 ‭ ℓ  GC=Ll SC=Common       SCRIPT SMALL L
U+1D425 ‭ 𝐥  GC=Ll SC=Common       MATHEMATICAL BOLD SMALL L
U+1D459 ‭ 𝑙  GC=Ll SC=Common       MATHEMATICAL ITALIC SMALL L
U+1D48D ‭ 𝒍  GC=Ll SC=Common       MATHEMATICAL BOLD ITALIC SMALL L
U+1D4C1 ‭ 𝓁  GC=Ll SC=Common       MATHEMATICAL SCRIPT SMALL L
U+1D4F5 ‭ 𝓵  GC=Ll SC=Common       MATHEMATICAL BOLD SCRIPT SMALL L
U+1D529 ‭ 𝔩  GC=Ll SC=Common       MATHEMATICAL FRAKTUR SMALL L
U+1D55D ‭ 𝕝  GC=Ll SC=Common       MATHEMATICAL DOUBLE-STRUCK SMALL L
U+1D591 ‭ 𝖑  GC=Ll SC=Common       MATHEMATICAL BOLD FRAKTUR SMALL L
U+1D5C5 ‭ 𝗅  GC=Ll SC=Common       MATHEMATICAL SANS-SERIF SMALL L
U+1D5F9 ‭ 𝗹  GC=Ll SC=Common       MATHEMATICAL SANS-SERIF BOLD SMALL L
U+1D62D ‭ 𝘭  GC=Ll SC=Common       MATHEMATICAL SANS-SERIF ITALIC SMALL L
U+1D661 ‭ 𝙡  GC=Ll SC=Common       MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL L
U+1D695 ‭ 𝚕  GC=Ll SC=Common       MATHEMATICAL MONOSPACE SMALL L
U+001C0 ‭ ǀ  GC=Lo SC=Latin        LATIN LETTER DENTAL CLICK
U+00399 ‭ Ι  GC=Lu SC=Greek        GREEK CAPITAL LETTER IOTA
U+1D6B0 ‭ 𝚰  GC=Lu SC=Common       MATHEMATICAL BOLD CAPITAL IOTA
U+1D6EA ‭ 𝛪  GC=Lu SC=Common       MATHEMATICAL ITALIC CAPITAL IOTA
U+1D724 ‭ 𝜤  GC=Lu SC=Common       MATHEMATICAL BOLD ITALIC CAPITAL IOTA
U+1D75E ‭ 𝝞  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD CAPITAL IOTA
U+1D798 ‭ 𝞘  GC=Lu SC=Common       MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL IOTA
U+02C92 ‭ Ⲓ  GC=Lu SC=Coptic       COPTIC CAPITAL LETTER IAUDA
U+00406 ‭ І  GC=Lu SC=Cyrillic     CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
U+004C0 ‭ Ӏ  GC=Lu SC=Cyrillic     CYRILLIC LETTER PALOCHKA
U+005D5 ‭ ו  GC=Lo SC=Hebrew       HEBREW LETTER VAV
U+005DF ‭ ן  GC=Lo SC=Hebrew       HEBREW LETTER FINAL NUN
U+007CA ‭ ߊ  GC=Lo SC=Nko          NKO LETTER A
U+02D4F ‭ ⵏ  GC=Lo SC=Tifinagh     TIFINAGH LETTER YAN
U+0A4F2 ‭ ꓲ  GC=Lo SC=Lisu         LISU LETTER I

Nifty though that is, it gets even better. The datafiles include not just single-codepoint confusables, but also confusables that may in some cases require multiple code points. For example, here’s one such set, this time in file-native format:

#       C̦       С̡       Ç       Ҫ
        (‎ C̦ ‎) 0043 0326        LATIN CAPITAL LETTER C, COMBINING COMMA BELOW
←       (‎ С̡ ‎) 0421 0321        CYRILLIC CAPITAL LETTER ES, COMBINING PALATALIZED HOOK BELOW
←       (‎ Ç ‎) 00C7     LATIN CAPITAL LETTER C WITH CEDILLA    # →Ҫ→→С̡→
←       (‎ Ҫ ‎) 04AA     CYRILLIC CAPITAL LETTER ES WITH DESCENDER      # →С̡→

Isn’t that swell? The only hitch is unless you use the ICU classes, you’ll have to roll your own from the UTS#39 datafiles.

Since there are no other language bindings that I am aware of, I’ve added to my ᴛᴏᴅᴏ list to create Perl bindings to mimic the ICU style of writing \p{X-Confusable=I} in the regex engine.

Note that you may also wish to consider both UTS#36 and UTS#39, which the ICU SpoofChecker class handles for you. It’s specifically for URI-type things (read: Internet identifers, which use a restricted character set), not just any old arbitrary text.

like image 58
tchrist Avatar answered Nov 14 '22 09:11

tchrist


Yes. Look in UnicodeData.txt:

2160;ROMAN NUMERAL ONE;Nl;0;L;<compat> 0049;;;1;N;;;;2170;
like image 32
dan04 Avatar answered Nov 14 '22 07:11

dan04


The answer by @dan04 is the correct answer to the explicit question, but the indirect question “if characters that are visually indistinguishable have compatibility equivalence” has a more complicated answer.

As a rule, canonically equivalent characters or character sequences are supposed to look similar. They are, roughly speaking, difference presentations of the same intuitive character as encoded characters. This however depends on several practical considerations, and the renderings might in fact be different.

On the other hand, characters can be visually indistinguishable even though their renderings (glyphs) are identical in every known font. For example, any normal font that contains the capital Latin letter A, the capital Greek letter alpha, and the capital Cyrillic letter A have identical glyphs for them, but they are still completely distinct characters, with no equivalence mapping between them.

Compatibility equivalent characters may differ in presentation, and they often do, partly because their difference is often stylistic. But they need not differ.

like image 3
Jukka K. Korpela Avatar answered Nov 14 '22 09:11

Jukka K. Korpela