Correct use of string storage in C and C++

Question

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?

I am particularly interested in POSIX compliance when writing software that leverages Unicode.

When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:

/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
    wprintf(L"Character comparison on a per-character basis.
");

How can you compare unicode bytes (or characters) when using char?

So far my preferred way of comparing strings and characters of type char in C often looks like this:

/* C code fragment */
const char *mail[] = { "ov€rlord@masters.lt", "ov€rlord@masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
    printf("%s
%zu", *mail, strlen(*mail));

This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?

In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.

一二三 · Accepted Answer

If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.

If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).

Correct use of string storage in C and C++

Tags:

c++

c

posix

character-encoding

unicode

user1254893

1 Answers

一二三

Recent Activity

Donate For Us

Correct use of string storage in C and C++

Tags:

c++

c

posix

character-encoding

unicode

user1254893

1 Answers

一二三

Related questions

Recent Activity

Donate For Us