Canonical Unicode string form

Question

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

Joey · Accepted Answer

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

Canonical Unicode string form

Tags:

c++

c

unicode

collation

unicode-normalization

Cartesius00

1 Answers

Joey

Recent Activity

Donate For Us

Canonical Unicode string form

Tags:

c++

c

unicode

collation

unicode-normalization

Cartesius00

1 Answers

Joey

Related questions

Recent Activity

Donate For Us