Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?

I am particularly interested in POSIX compliance when writing software that leverages Unicode.

When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:

/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
    wprintf(L"Character comparison on a per-character basis.\n");

How can you compare unicode bytes (or characters) when using char?

So far my preferred way of comparing strings and characters of type char in C often looks like this:

/* C code fragment */
const char *mail[] = { "ov€[email protected]", "ov€[email protected]" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
    printf("%s\n%zu", *mail, strlen(*mail));

This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?

In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.

like image 461
user1254893 Avatar asked Mar 18 '12 10:03

user1254893


1 Answers

If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.

If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).

like image 158
一二三 Avatar answered Oct 26 '22 22:10

一二三