Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 processing in C

Tags:

c

unicode

utf-8

I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.

What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.

Thanks,

like image 705
lang2 Avatar asked Jun 08 '12 11:06

lang2


1 Answers

GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.

For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.

However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.

On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.

Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is “Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them.”

like image 85
tchrist Avatar answered Oct 12 '22 04:10

tchrist