Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will strcmp compare utf-8 strings in code point order?

In a C program, I want to sort a list of valid UTF-8-encoded strings in Unicode code point order. No collation, no locale-awareness.

So I need a compare function. It's easy enough to write such a function that iterates over the unicode characters. (I happen to be using GLib, so I'd iterate withg_utf8_next_char and compare the return values of g_utf8_next_char.)

But what I'm wondering, out of curiousity and possibly simplicity and efficiency, is: will a simple byte-for-byte strcmp (or g_strcmp) actually do the same job? I'm thinking that it should, since UTF-8 encodes the most significant bits first, and a code point that needs encoding in N+1 bytes will have a larger initial byte than a code point that needs to be encoded in N bytes.

But maybe I'm missing something? Thanks in advance.

like image 854
skagedal Avatar asked Aug 20 '13 07:08

skagedal


1 Answers

Yes, UTF-8 preserves codepoint order, so you can just use strcmp. That's one of the (many) beautiful points of UTF-8.

One caveat is that codepoints in Unicode are UTF-32 values, and some people who talk about collating Unicode strings in "codepoint" order are actually using the word "codepoint" incorrectly to mean "UTF-16 code unit". If you want the order to match UTF-16 code unit collation, a good bit more work is involved.

like image 106
R.. GitHub STOP HELPING ICE Avatar answered Oct 01 '22 03:10

R.. GitHub STOP HELPING ICE