Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Canonical Unicode string form

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

like image 642
Cartesius00 Avatar asked Mar 03 '26 13:03

Cartesius00


1 Answers

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

like image 97
Joey Avatar answered Mar 06 '26 04:03

Joey