In a C program, I want to sort a list of valid UTF-8-encoded strings in Unicode code point order. No collation, no locale-awareness.
So I need a compare function. It's easy enough to write such a function that iterates over the unicode characters. (I happen to be using GLib, so I'd iterate withg_utf8_next_char
and compare the return values of g_utf8_next_char
.)
But what I'm wondering, out of curiousity and possibly simplicity and efficiency, is: will a simple byte-for-byte strcmp
(or g_strcmp
) actually do the same job? I'm thinking that it should, since UTF-8 encodes the most significant bits first, and a code point that needs encoding in N+1 bytes will have a larger initial byte than a code point that needs to be encoded in N bytes.
But maybe I'm missing something? Thanks in advance.
Yes, UTF-8 preserves codepoint order, so you can just use strcmp
. That's one of the (many) beautiful points of UTF-8.
One caveat is that codepoints in Unicode are UTF-32 values, and some people who talk about collating Unicode strings in "codepoint" order are actually using the word "codepoint" incorrectly to mean "UTF-16 code unit". If you want the order to match UTF-16 code unit collation, a good bit more work is involved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With