Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What encoding does c32rtomb convert to?

Tags:

c++

c++11

unicode

The functions c32rtomb and mbrtoc32 from <cuchar>/<uchar.h> are described in the C Unicode TR (draft) as performing conversions between UTF-321 and "multibyte characters".

(...) If s is not a null pointer, the c32rtomb function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given by c32 (including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to by s. (...)

What is this "multibyte character representation"? I'm actually interested in the behaviour of the following program:

#include <cassert>
#include <cuchar>
#include <string>

int main() {
    std::u32string u32 = U"this is a wide string";
    std::string narrow  = "this is a wide string";
    std::string converted(1000, '\0');
    char* ptr = &converted[0];
    std::mbstate_t state {};
    for(auto u : u32) {
        ptr += std::c32rtomb(ptr, u, &state);
    }
    converted.resize(ptr - &converted[0]);
    assert(converted == narrow);
}

Is the assertion in it guaranteed to hold1?


1 Working under the assumption that __STDC_UTF_32__ is defined.

like image 839
R. Martinho Fernandes Avatar asked Oct 24 '12 08:10

R. Martinho Fernandes


2 Answers

For the assertion to be guaranteed to hold true it's necessary that the multibyte encoding used by c32rtomb() be the same as the encoding used for string literals, at least as far as the characters actually used in the string.

C99 7.11.1.1/2 specifies that setlocale() with the category LC_CTYPE affects the behavior of the character handling functions and the multibyte and wide character functions. I don't see any explicit acknowledgement that the effect is to set the multibyte and wide character encodings used, however that is the intent.

So the multibyte encoding used by c32rtomb() is the multibyte encoding from the default "C" locale.

C++11 2.14.3/2 specifies that the execution encoding, wide execution encoding, UTF-16, and UTF-32 are used for the corresponding character and string literals. Therefore std::string narrow uses the execution encoding to represent that string.

So is the "C" locale encoding of this string the same as the execution encoding of this string?

C99 7.11.1.1/3 specifies that the "C" locale provides "the minimal environment" for C translation. Such an environment would include not only character sets, but also the specific character codes used. So I believe this means not only that the "C" locale must support the characters required in translation (i.e., the basic character set), but additionally that those characters in the "C" locale must use the same character codes.

All of the characters in your string literals are members of the basic character set, and therefore converting the char32_t representation to the char "C" locale representation must produce the same sequence of values as the compiler produces for the char string literal; the assertion must hold true.

I don't see any suggestion that anything beyond the basic character set is supported in a compatible way between the execution encoding and the "C" locale, so if your string literal used any characters outside the basic character set then there would not be any guarantee that the assertion would hold. Even stipulating extended characters that exist in both the execution character set and the "C" locale, I don't see any requirement that the representations match each other.

like image 132
bames53 Avatar answered Nov 10 '22 11:11

bames53


The TR linked in the question says

At most MB_CUR_MAX bytes are stored.

which is defined (in C99) as

a positive integer expression with type size_t that is the maximum number of bytes in a multibyte character for the extended character set specified by the current locale

I believe this is sufficient evidence that the intent of the TR was to produce the multibyte characters as defined by the currently installed C locale: UTF-8 for en_US.utf8, GB18030 for zh_CN.gb18030, etc.

like image 5
Cubbi Avatar answered Nov 10 '22 10:11

Cubbi