Unicode defines two kinds of equivalence 000 canonical equivalence and compatibility equivalence. The example in Unicode Technical Annex #15 for compatibility equivalence is SUPERSCRIPT ONE (U+00B9) and DIGIT ONE (U+0031). It doesn't discuss characters that are visually indistinguishable.
I am curious if characters that are visually indistinguishable have compatibility equivalence under the standard.
Thanks..
ᴇᴅɪᴛ: Added exactly what the original question is looking for at the bottom. This is really cool.
The answer to your question about ʀᴏᴍᴀɴ ɴᴜᴍᴇʀᴀʟ ᴏɴᴇ and ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ is YES. Here’s a quick way to check:
$ perl -Mcharnames=:full -MUnicode::Normalize -le 'print
NFKD "\N{ROMAN NUMERAL ONE}" eq NFKD "\N{LATIN CAPITAL LETTER I}"'
1
However, the answer to your question as to whether characters that are visually indistinguishable have compatibility equivalence is most definitely NO!
For example, ᴄʜᴇʀᴏᴋᴇᴇ ʟᴇᴛᴛᴇʀ ɢᴏ (Ꭺ) looks like ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (A), but is certainly not NFKD equivalent. Similarly with ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀʟᴘʜᴀ (Α) and ᴄʏʀɪʟʟɪᴄ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ (А) not being NFKD equivalent. There are effectively uncountably many (well, I can’t count them :) such issues. The only code points that are NFKD-equiv to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴀ, for example, are:
U+00041 A GC=Lu SC=Latin LATIN CAPITAL LETTER A
U+01D2C ᴬ GC=Lm SC=Latin MODIFIER LETTER CAPITAL A
U+024B6 Ⓐ GC=So SC=Common CIRCLED LATIN CAPITAL LETTER A
U+0FF21 A GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER A
U+1D400 𝐀 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL A
U+1D434 𝐴 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL A
U+1D468 𝑨 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL A
U+1D49C 𝒜 GC=Lu SC=Common MATHEMATICAL SCRIPT CAPITAL A
U+1D4D0 𝓐 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL A
U+1D504 𝔄 GC=Lu SC=Common MATHEMATICAL FRAKTUR CAPITAL A
U+1D538 𝔸 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL A
U+1D56C 𝕬 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL A
U+1D5A0 𝖠 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL A
U+1D5D4 𝗔 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL A
U+1D608 𝘈 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL A
U+1D63C 𝘼 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A
U+1D670 𝙰 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL A
U+1F130 🄰 GC=So SC=Common SQUARED LATIN CAPITAL LETTER A
Similarly, here are the codepoints that are NFKD equiv to the ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ you were looking at:
U+00049 I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+01D35 ᴵ GC=Lm SC=Latin MODIFIER LETTER CAPITAL I
U+02110 ℐ GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ℑ GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+02160 Ⅰ GC=Nl SC=Latin ROMAN NUMERAL ONE
U+024BE Ⓘ GC=So SC=Common CIRCLED LATIN CAPITAL LETTER I
U+0FF29 I GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+1D408 𝐈 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C 𝐼 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 𝑰 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 𝓘 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 𝕀 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 𝕴 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 𝖨 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC 𝗜 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 𝘐 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 𝙄 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 𝙸 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+1F138 🄸 GC=So SC=Common SQUARED LATIN CAPITAL LETTER I
Notice there’s no ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪᴏᴛᴀ there, just as one example.
You can’t use NFKD to find lookalikes, and some things that are NKFD equiv don’t look much alike. So you can’t do it that way in the general case. It’s not a problem you can even begin to look at without looking at actual fonts.
I believe ICU has an extended, non-standard property for this, like \p{X-Confusable=A}
. I downloaded their datafiles for this, but haven’t played with it much yet.
It turns out that UTS #39, Unicode Security Mechanisms, has exactly what you are looking for. If you fetch its raw, plaintext datafiles, you will be able to determine which code points are potentially confusable with one another.
For example, in the text earlier in this message, I enumerated the code points that were NFKD equivalent to ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ, and pointed out that many potential confusables were missing from that set. That’s because the NFKD mapping is not designed to detect confusables. However, the datafiles from UTS#39 very much are designed for just that very purpose.
To redo my ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ɪ enumeration, updating it to handle all code points that UTS#39 deems mutually confusable with it, we have these, formatted using unichars and sorted in order of the Unicode Collation Algorithm using ucsort:
U+0007C | GC=Sm SC=Common VERTICAL LINE
U+02223 ∣ GC=Sm SC=Common DIVIDES
U+0FFE8 │ GC=So SC=Common HALFWIDTH FORMS LIGHT VERTICAL
U+00031 1 GC=Nd SC=Common DIGIT ONE
U+1D7CF 𝟏 GC=Nd SC=Common MATHEMATICAL BOLD DIGIT ONE
U+1D7D9 𝟙 GC=Nd SC=Common MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
U+1D7E3 𝟣 GC=Nd SC=Common MATHEMATICAL SANS-SERIF DIGIT ONE
U+1D7ED 𝟭 GC=Nd SC=Common MATHEMATICAL SANS-SERIF BOLD DIGIT ONE
U+1D7F7 𝟷 GC=Nd SC=Common MATHEMATICAL MONOSPACE DIGIT ONE
U+00049 I GC=Lu SC=Latin LATIN CAPITAL LETTER I
U+0FF29 I GC=Lu SC=Latin FULLWIDTH LATIN CAPITAL LETTER I
U+02160 Ⅰ GC=Nl SC=Latin ROMAN NUMERAL ONE
U+02110 ℐ GC=Lu SC=Common SCRIPT CAPITAL I
U+02111 ℑ GC=Lu SC=Common BLACK-LETTER CAPITAL I
U+1D408 𝐈 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL I
U+1D43C 𝐼 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL I
U+1D470 𝑰 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL I
U+1D4D8 𝓘 GC=Lu SC=Common MATHEMATICAL BOLD SCRIPT CAPITAL I
U+1D540 𝕀 GC=Lu SC=Common MATHEMATICAL DOUBLE-STRUCK CAPITAL I
U+1D574 𝕴 GC=Lu SC=Common MATHEMATICAL BOLD FRAKTUR CAPITAL I
U+1D5A8 𝖨 GC=Lu SC=Common MATHEMATICAL SANS-SERIF CAPITAL I
U+1D5DC 𝗜 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL I
U+1D610 𝘐 GC=Lu SC=Common MATHEMATICAL SANS-SERIF ITALIC CAPITAL I
U+1D644 𝙄 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL I
U+1D678 𝙸 GC=Lu SC=Common MATHEMATICAL MONOSPACE CAPITAL I
U+00196 Ɩ GC=Lu SC=Latin LATIN CAPITAL LETTER IOTA
U+0006C l GC=Ll SC=Latin LATIN SMALL LETTER L
U+0FF4C l GC=Ll SC=Latin FULLWIDTH LATIN SMALL LETTER L
U+0217C ⅼ GC=Nl SC=Latin SMALL ROMAN NUMERAL FIFTY
U+02113 ℓ GC=Ll SC=Common SCRIPT SMALL L
U+1D425 𝐥 GC=Ll SC=Common MATHEMATICAL BOLD SMALL L
U+1D459 𝑙 GC=Ll SC=Common MATHEMATICAL ITALIC SMALL L
U+1D48D 𝒍 GC=Ll SC=Common MATHEMATICAL BOLD ITALIC SMALL L
U+1D4C1 𝓁 GC=Ll SC=Common MATHEMATICAL SCRIPT SMALL L
U+1D4F5 𝓵 GC=Ll SC=Common MATHEMATICAL BOLD SCRIPT SMALL L
U+1D529 𝔩 GC=Ll SC=Common MATHEMATICAL FRAKTUR SMALL L
U+1D55D 𝕝 GC=Ll SC=Common MATHEMATICAL DOUBLE-STRUCK SMALL L
U+1D591 𝖑 GC=Ll SC=Common MATHEMATICAL BOLD FRAKTUR SMALL L
U+1D5C5 𝗅 GC=Ll SC=Common MATHEMATICAL SANS-SERIF SMALL L
U+1D5F9 𝗹 GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD SMALL L
U+1D62D 𝘭 GC=Ll SC=Common MATHEMATICAL SANS-SERIF ITALIC SMALL L
U+1D661 𝙡 GC=Ll SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL L
U+1D695 𝚕 GC=Ll SC=Common MATHEMATICAL MONOSPACE SMALL L
U+001C0 ǀ GC=Lo SC=Latin LATIN LETTER DENTAL CLICK
U+00399 Ι GC=Lu SC=Greek GREEK CAPITAL LETTER IOTA
U+1D6B0 𝚰 GC=Lu SC=Common MATHEMATICAL BOLD CAPITAL IOTA
U+1D6EA 𝛪 GC=Lu SC=Common MATHEMATICAL ITALIC CAPITAL IOTA
U+1D724 𝜤 GC=Lu SC=Common MATHEMATICAL BOLD ITALIC CAPITAL IOTA
U+1D75E 𝝞 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD CAPITAL IOTA
U+1D798 𝞘 GC=Lu SC=Common MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL IOTA
U+02C92 Ⲓ GC=Lu SC=Coptic COPTIC CAPITAL LETTER IAUDA
U+00406 І GC=Lu SC=Cyrillic CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
U+004C0 Ӏ GC=Lu SC=Cyrillic CYRILLIC LETTER PALOCHKA
U+005D5 ו GC=Lo SC=Hebrew HEBREW LETTER VAV
U+005DF ן GC=Lo SC=Hebrew HEBREW LETTER FINAL NUN
U+007CA ߊ GC=Lo SC=Nko NKO LETTER A
U+02D4F ⵏ GC=Lo SC=Tifinagh TIFINAGH LETTER YAN
U+0A4F2 ꓲ GC=Lo SC=Lisu LISU LETTER I
Nifty though that is, it gets even better. The datafiles include not just single-codepoint confusables, but also confusables that may in some cases require multiple code points. For example, here’s one such set, this time in file-native format:
# C̦ С̡ Ç Ҫ
( C̦ ) 0043 0326 LATIN CAPITAL LETTER C, COMBINING COMMA BELOW
← ( С̡ ) 0421 0321 CYRILLIC CAPITAL LETTER ES, COMBINING PALATALIZED HOOK BELOW
← ( Ç ) 00C7 LATIN CAPITAL LETTER C WITH CEDILLA # →Ҫ→→С̡→
← ( Ҫ ) 04AA CYRILLIC CAPITAL LETTER ES WITH DESCENDER # →С̡→
Isn’t that swell? The only hitch is unless you use the ICU classes, you’ll have to roll your own from the UTS#39 datafiles.
Since there are no other language bindings that I am aware of, I’ve added to my ᴛᴏᴅᴏ list to create Perl bindings to mimic the ICU style of writing \p{X-Confusable=I}
in the regex engine.
Note that you may also wish to consider both UTS#36 and UTS#39, which the ICU SpoofChecker
class handles for you. It’s specifically for URI-type things (read: Internet identifers, which use a restricted character set), not just any old arbitrary text.
Yes. Look in UnicodeData.txt:
2160;ROMAN NUMERAL ONE;Nl;0;L;<compat> 0049;;;1;N;;;;2170;
The answer by @dan04 is the correct answer to the explicit question, but the indirect question “if characters that are visually indistinguishable have compatibility equivalence” has a more complicated answer.
As a rule, canonically equivalent characters or character sequences are supposed to look similar. They are, roughly speaking, difference presentations of the same intuitive character as encoded characters. This however depends on several practical considerations, and the renderings might in fact be different.
On the other hand, characters can be visually indistinguishable even though their renderings (glyphs) are identical in every known font. For example, any normal font that contains the capital Latin letter A, the capital Greek letter alpha, and the capital Cyrillic letter A have identical glyphs for them, but they are still completely distinct characters, with no equivalence mapping between them.
Compatibility equivalent characters may differ in presentation, and they often do, partly because their difference is often stylistic. But they need not differ.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With