Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode comparison of Cyrillic 'С' and Latin 'C'

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.

like image 275
Peter Graham Avatar asked Oct 14 '13 00:10

Peter Graham


People also ask

What is a Unicode character example?

The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

What are Unicode values?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

How many Unicode symbols are there?

Short answer: There are 1,111,998 possible Unicode characters. Longer answer: There are 17×216 – 2048 – 66 = 1,111,998 possible Unicode characters: seventeen 16-bit planes, with 2048 values reserved as surrogates, and 66 reserved as non-characters.

What is a Unicode alphabetic character?

The alphabetic characters are those UNICODE characters which are defined as letters by the UNICODE standard, e.g., the ASCII characters. ABCDEFGHIJKLMNOPQRSTUVWXYZ. abcdefghijklmnopqrstuvwxyz. and the international alphabetic characters from the character set: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜßáíóúñѪºãõØøÀÃÕ


1 Answers

There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.

like image 100
Jukka K. Korpela Avatar answered Sep 19 '22 12:09

Jukka K. Korpela