The functions c32rtomb
and mbrtoc32
from <cuchar>
/<uchar.h>
are described in the C Unicode TR (draft) as performing conversions between UTF-321 and "multibyte characters".
(...) If
s
is not a null pointer, thec32rtomb
function determines the number of bytes needed to represent the multibyte character that corresponds to the wide character given byc32
(including any shift sequences), and stores the multibyte character representation in the array whose first element is pointed to bys
. (...)
What is this "multibyte character representation"? I'm actually interested in the behaviour of the following program:
#include <cassert>
#include <cuchar>
#include <string>
int main() {
std::u32string u32 = U"this is a wide string";
std::string narrow = "this is a wide string";
std::string converted(1000, '\0');
char* ptr = &converted[0];
std::mbstate_t state {};
for(auto u : u32) {
ptr += std::c32rtomb(ptr, u, &state);
}
converted.resize(ptr - &converted[0]);
assert(converted == narrow);
}
Is the assertion in it guaranteed to hold1?
1 Working under the assumption that __STDC_UTF_32__
is defined.
For the assertion to be guaranteed to hold true it's necessary that the multibyte encoding used by c32rtomb()
be the same as the encoding used for string literals, at least as far as the characters actually used in the string.
C99 7.11.1.1/2 specifies that setlocale()
with the category LC_CTYPE
affects the behavior of the character handling functions and the multibyte and wide character functions. I don't see any explicit acknowledgement that the effect is to set the multibyte and wide character encodings used, however that is the intent.
So the multibyte encoding used by c32rtomb()
is the multibyte encoding from the default "C" locale.
C++11 2.14.3/2 specifies that the execution encoding, wide execution encoding, UTF-16, and UTF-32 are used for the corresponding character and string literals. Therefore std::string narrow
uses the execution encoding to represent that string.
So is the "C" locale encoding of this string the same as the execution encoding of this string?
C99 7.11.1.1/3 specifies that the "C" locale provides "the minimal environment" for C translation. Such an environment would include not only character sets, but also the specific character codes used. So I believe this means not only that the "C" locale must support the characters required in translation (i.e., the basic character set), but additionally that those characters in the "C" locale must use the same character codes.
All of the characters in your string literals are members of the basic character set, and therefore converting the char32_t
representation to the char
"C" locale representation must produce the same sequence of values as the compiler produces for the char
string literal; the assertion must hold true.
I don't see any suggestion that anything beyond the basic character set is supported in a compatible way between the execution encoding and the "C" locale, so if your string literal used any characters outside the basic character set then there would not be any guarantee that the assertion would hold. Even stipulating extended characters that exist in both the execution character set and the "C" locale, I don't see any requirement that the representations match each other.
The TR linked in the question says
At most
MB_CUR_MAX
bytes are stored.
which is defined (in C99) as
a positive integer expression with type
size_t
that is the maximum number of bytes in a multibyte character for the extended character set specified by the current locale
I believe this is sufficient evidence that the intent of the TR was to produce the multibyte characters as defined by the currently installed C locale: UTF-8 for en_US.utf8
, GB18030 for zh_CN.gb18030
, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With