Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting UTF-8 strings?

Tags:

c++

unicode

My std::strings are encoded in UTF-8 so the std::string < operator doesn't cut it. How could I compare 2 utf-8 encoded std::strings?

where it does not cut it is for accents, é comes after z which it should not

Thanks

like image 395
jmasterx Avatar asked Dec 09 '22 11:12

jmasterx


1 Answers

If you don't want a lexicographic ordering (which is what sorting the UTF-8 encoded strings lexicographically will give you), then you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as appropriate, and apply a suitable comparison function of your choosing.

To reiterate the point, the UTF-8 encoding mechanism is cleverly designed so that if you sort by looking at the numeric value of each 8-bit encoded byte, you will get the same result as if you first decoded the string into Unicode and compared the numeric values of each code point.

Update: Your updated question indicates that you want a more complex comparison function than purely a lexicographic sort. You will need to decode your UTF-8 strings and compare the decoded characters.

like image 65
Greg Hewgill Avatar answered Dec 22 '22 21:12

Greg Hewgill