Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U. Is there some list or algorithm to do this: <ul> <li>Given a Ú or Ù or Ü return the English U</li> <li>Given a English U, return the list of all U-similar characters</li> </ul> I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this? UPDATE If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.

It is very unclear what you are asking to do here. <ul> <li>There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, &scaron;, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….</li> <li>There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, ｅ, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ﬅ, ﬆ, ｓ, … or R, ᴿ, ₨, ℛ, &real;, ℝ, Ⓡ, ㏚, Ｒ, ….</li> <li>There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….</li> <li>Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….</li> <li>Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱𐄇𐅂𐅘𐅙𐅚𐌠𐏑𐒡𐡘𐤖𐩀𐩽𐭘𐭸𐹠𒐕𒐞𒐬𒐴𒑏𒑘𝍠𝟏𝟙𝟣𝟭𝟷 🄂 Ⅰⅰꛦ㆒㈠㊀𑁒𑁧.</li> <li>Characters that all have the same primary collation strength, like all these that are the same as d: DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹꝺＤｄ𝐃𝐝𝐷𝑑𝑫𝒅𝒟𝒹𝓓𝓭𝔇𝔡𝔻𝕕𝕯𝖉𝖣𝖽𝗗𝗱𝘋𝘥𝘿𝙙𝙳𝚍 🄳 🅓 🅳 🇩 . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common ð or the newish ꝺ can be equated to d only through a primary UCA strength comparison; same with ƶ and z, ȼ and c, etc.</li> <li>Characters that are same in certain locales, like æ and ae, or ä and ae, or ä and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ç are the same character while in others they are not; similarly for n and ñ, or a and á and ã, ….</li> </ul> Some of these can be handled. Some cannot. All require different approaches depending on different needs. What is your real goal?

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents: <pre class="prettyprint"><code># coding: utf8 import unicodedata as ud s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự' print ud.normalize('NFD',s).encode('ascii','ignore') </code></pre> <h3>Output</h3> <pre class="prettyprint"><code>U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U </code></pre> To find accent characters, use something like: <pre class="prettyprint"><code>import unicodedata as ud import string def asc(unichr): return ud.normalize('NFD',unichr).encode('ascii','ignore') U = u''.join(unichr(i) for i in xrange(65536)) for c in string.letters: print u''.join(u for u in U if asc(u) == c) </code></pre> <h3>Output</h3> <pre class="prettyprint"><code>aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ bḃḅḇ cçćĉċčḉ dďḋḍḏḑḓ eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ fḟ : etc. </code></pre>

Find characters that are similar glyphically in Unicode?

2 Answers

It is very unclear what you are asking to do here.

There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….
There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, ｅ, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ﬅ, ﬆ, ｓ, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, Ｒ, ….
There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….
Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….
Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱𐄇𐅂𐅘𐅙𐅚𐌠𐏑𐒡𐡘𐤖𐩀𐩽𐭘𐭸𐹠𒐕𒐞𒐬𒐴𒑏𒑘𝍠𝟏𝟙𝟣𝟭𝟷 🄂 Ⅰⅰꛦ㆒㈠㊀𑁒𑁧.
Characters that all have the same primary collation strength, like all these that are the same as d: DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹꝺＤｄ𝐃𝐝𝐷𝑑𝑫𝒅𝒟𝒹𝓓𝓭𝔇𝔡𝔻𝕕𝕯𝖉𝖣𝖽𝗗𝗱𝘋𝘥𝘿𝙙𝙳𝚍 🄳 🅓 🅳 🇩 . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common ð or the newish ꝺ can be equated to d only through a primary UCA strength comparison; same with ƶ and z, ȼ and c, etc.
Characters that are same in certain locales, like æ and ae, or ä and ae, or ä and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ç are the same character while in others they are not; similarly for n and ñ, or a and á and ã, ….

Some of these can be handled. Some cannot. All require different approaches depending on different needs.

What is your real goal?

answered Oct 07 '22 07:10

tchrist

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

Output

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

To find accent characters, use something like:

import unicodedata as ud
import string

def asc(unichr):
    return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
    print u''.join(u for u in U if asc(u) == c)

Output

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
 :
etc.

answered Oct 07 '22 09:10

Mark Tolonen

Related questions
                            
                                What is the point of COLLATIONS for nvarchar (Unicode) columns?
                            
                                Selenium webdriver and unicode
                            
                                How to initialize char array using hex numbers?
                            
                                latin-1 to ascii
                            
                                Using non-ASCII characters inside functions for packages
                            
                                In C++ when to use WCHAR and when to use CHAR
                            
                                Difference between encoding utf-8 and utf8 in Python 3.5
                            
                                D2009 TStringlist ansistring
                            
                                How to deal with unicode string in URL in python3?
                            
                                Convert Unicode surrogate pair to literal string
                            
                                Check whether the JSON (object property exists) & print it as unicode decoded
                            
                                Unicode characters for «email», «save», «print»
                            
                                Is there a Unicode character for plus over minus? (+/-)
                            
                                Unable to encode/decode pprint output
                            
                                unicode characters appear as question marks in IntelliJ IDEA console
                            
                                Is it possible to use a Unicode "argv"?
                            
                                python : working with german umlaut
                            
                                What makes a good test string for testing web forms for unicode compatibility?
                            
                                Remove invalid UTF-8 characters from a string
                            
                                Check unicode in PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find characters that are similar glyphically in Unicode?

Tags:

unicode

glyph

Zabba

People also ask

2 Answers

tchrist

Output

Output

Mark Tolonen

Recent Activity

Donate For Us