Unicode comparison of Cyrillic 'С' and Latin 'C'

Tags:

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.

275

asked Oct 14 '13 00:10

Peter Graham

1 Answers

There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.

100

answered Sep 19 '22 12:09

Jukka K. Korpela

Related questions
                            
                                Length of Utf-32 character in Qt
                            
                                encode unicode characters to unicode escape sequences
                            
                                Many emoji characters are not read by python file read
                            
                                removing all non-printing characters by regex
                            
                                Redirecting ConsoleOutput containing pseudo-loc (unicode) strings in C#
                            
                                GCC, Unicode and __FUNCTION__
                            
                                how to write unicode hello world in C on windows
                            
                                How to sort UTF-8 lines in Vim?
                            
                                Recognizing text as Simplified vs. Traditional Chinese
                            
                                web service unicode characters display as question marks
                            
                                Python UTF-8 conversion problem
                            
                                Node.JS Big-Endian UCS-2
                            
                                How to transform a string to lowercase with preg_replace
                            
                                Is there a category or name for characters like smart quotes and that dash that always breaks?
                            
                                possible to raise exception that includes non-english characters in python 2?
                            
                                Output unicode symbol π and ≈ in c++ win32 console application
                            
                                Send an SMS message (UTF-16) with an unknown character replaced by a "replacement character" in Android
                            
                                UnicodeDecodeError when I rename a plone site's name
                            
                                std::string and UTF-8 encoded unicode
                            
                                How to display Unicode with FLTK?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode comparison of Cyrillic 'С' and Latin 'C'

Tags:

unicode

collation

normalization

unicode-normalization

accent-insensitive

Peter Graham

People also ask

1 Answers

Jukka K. Korpela

Recent Activity

Donate For Us