I want to find visually identical characters for a specific character in Unicode. I know how to find canonical or compatibility decompositions of a character; but they do not give me what I want. I want to find characters that are visually identical (not similar), and their only difference can be their sizes.
for example I want : (s,S), or (S,S) (whose code points are different). I do not want (ß, β), or (e, é).
Any suggestions? Thanks.
Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler's explanation in the Unicode mailing list.
Homoglyphs are characters that resemble each other, such as the letter O and zero ('0'), the Latin letter “H” and the Cyrillic letter “H,” or the uppercase “I” (“I”) and the lowercase letter “l” (L), which look identical in a sans serif font (like Calibri).
Unicode. Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.
For a particular character, you could start from annotations in the code charts in the Unicode standard. The annotations often refer to other characters for various reasons, including similarity or identity of shape. But the annotations are not meant to cover everything.
You could also draw your character at http://shapecatcher.com/ and ask it to recognize it. You often get a long list of visually similar alternatives.
As @TedHopp writes in his comment, visual identity is font-dependent. For example, “s” and “S” need not be identical in shape; in most fonts, they are not – the basic form is the same, but there are various differences in stroke width variation, curvature, serifs, etc. However, some characters can be expected to be visually identical in any font that contains them, such as Latin capital A, Greek capital alpha Α, and Cyrillic capital А.
You did not specify the purpose of the study, but you might be doing something that has been carried out to some extent by the Unicode Consortium. See UTR #6, Unicode Security Considerations, which also contains references to related work, including UTS #9, Unicode Security Mechanisms, which contains confusables.txt, Recommended confusable mapping for IDN (i.e., for a particular context, but it may be of interest for other purposes as well).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With