Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 support on cross platform C application

I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.

Question 1

Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.

I have found a UTF8 decoder here. Following function prototypes are from that code.

void utf8_decode_init(char p[], int length);

int utf8_decode_next();

The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?

If the above decoder is not good for production code, do you have a better recommendation?

Question 2

I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.

typedef unsigned char u8

Is my understanding correct? Also why is unsigned char required?

like image 806
Navaneeth K N Avatar asked Dec 21 '22 20:12

Navaneeth K N


1 Answers

  1. The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.

    You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.

  2. You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.

    Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).

like image 98
Jonathan Leffler Avatar answered Jan 01 '23 20:01

Jonathan Leffler