Update 2022 Jul 28
Two years after and char8_t
definition (or lack of) is now called a "C++20 defect" and there is a rush to fix it. Finally.
Update 2020 Aug 25
The question seems somewhat irrelevant in the light of this:
// GCC 10.2, clang 10.0.1 -std=c++20
int main(int argc, char ** argv)
{
char32_t single_glyph_32 = U'ア' ;
char16_t single_glyph_16 = u'ア' ;
// gcc: error: character constant too long for its type
// clang: error: character too large for enclosing character literal type
char8_t single_glyph_8 = u8'ア' ;
return 42;
}
char8_t seems capable of handling just a tiny portion of UTF-8 glyphs. Thus there is no much point in using it or trying to printf it.
Asked Nov 15 '19 at 14:04
And also for char8_t
?
I assume there is some C++20 decision, somewhere, but I could not find it.
There is also P1428, but that doc is not mentioning anything about printf()
family v.s. char8_t *
or char8_t
.
Use std::cout
advice might be an answer. Unfortunately, that does not compile anymore.
// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout << u8"A2";
std::cout << char8_t ('A');
For C 2.x and char8_t
Please start from here.
Update
I have done some more tests with a single element from a u8 sequence.
And that indeed does not work. char8_t *
to printf("%s")
does work, but char8_t
to printf("%c")
is an accident waiting to happen.
Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, char8_t
is not implemented, char8_t *
is. -- let me repeat: there is no implemented type to hold a single element from a char8_t *
sequence.
If you want a single u8 glyph you need to code it as an u8 string
char8_t const * single_glyph = u8"ア";
And it seems at present, to print the above the sort of a sure way is
// works with warnings
std::printf("%s", single_glyph ) ;
To start reading on this subject, probably these two papers are required
In that order.
My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]
The /Zc:char8_t compiler option enables the char8_t type keyword as specified in the C++20 standard. It causes the compiler to generate u8 prefixed character or string literals as const char8_t or const char8_t[N] types, respectively, instead of as const char or const char[N] types.
The %s means, "insert the first argument, a string, right here." The %d indicates that the second argument (an integer) should be placed there. There are different %-codes for different variable types, as well as options to limit the length of the variables and whatnot. Control Character.
%s tells printf that the corresponding argument is to be treated as a string (in C terms, a 0-terminated sequence of char ); the type of the corresponding argument must be char * . %d tells printf that the corresponding argument is to be treated as an integer value; the type of the corresponding argument must be int .
Save this answer. Show activity on this post. % indicates a format escape sequence used for formatting the variables passed to printf() . So you have to escape it to print the % character.
I'm the author of the char8_t
P0482 and P1423 proposals for C++ (accepted for C++20) and the N2653 proposal for C (accepted for C23).
Let's think about what the following should do:
printf("Hello %s\n", u8"Jöel");
std::cout << "Hello " << u8"Jöel" << "\n";
Actually, let's take a further step back. What encoding is expected on the receiver side of standard output? There are a few possibilities. If standard out is connected to a console/terminal, then the expected encoding is the one that the console/terminal is configured for. On a Windows system in the United States, this is likely to be CP437. On a UNIX/Linux system, this is likely UTF-8. On a z/OS system in the United States, this is likely EBCDIC code page 037. If standard out has been redirected, then the expected encoding is likely locale dependent. On a Windows system in the United States, that would mean the Active Code Page (ACP), likely Windows 1252. On UNIX/Linux and z/OS, it would likely be the same as the console/terminal (Windows is the odd system here that has different defaults for console encoding vs locale encoding).
Back to that example code. What is the expected or desired behavior for that UTF-8 encoded ö
character (U+00F6, {LATIN SMALL LETTER O WITH DIAERESIS}, encoded as 0xC3
0xB6
)? For Windows writing to the console, for the character to display properly, the encoded sequence would need to be transcoded to 0x94
while for Windows where locale dependent output is expected, it would need to be transcoded to 0xF6
. For UNIX/Linux, the sequence should probably be passed through. For z/OS, it may need to be transcoded to 0xCC
. But on all of these systems, these defaults are configurable (e.g., via the LANG
environment variable).
Assuming that transcoding to a run-time determined encoding is the desired behavior, how should transcoding errors be handled? For example, what should happen if the target encoding lacks representation for ö
? What if an ill-formed UTF-8 sequence is present? Should printf
stop and report an error? Should std::cout
throw an exception? Or should an implementation defined character such as U+FFFD {REPLACEMENT CHARACTER} or ?
be substituted?
What should happen if std::cout
is imbued with a std::codecvt
facet? Presumably that facet will expect incoming text to be in a particular encoding. Should UTF-8 text be transcoded to one of the execution character set, the locale dependent encoding, or the console/terminal encoding before being presented to the facet? If so, which one? Should the implementation have to be aware of whether the stream is connected to a console/terminal? What if the programmer wants to override the default and, for example, always write UTF-8?
These are rather difficult questions that we don't have good answers for. std::u8out
has been suggested, as a way to explicitly opt-in to UTF-8, but doesn't solve the problems of expected standard output encoding, issues with codecvt
facets, and other iostreams problems like implicit locale dependent formatting.
Personally, in order to provide good Unicode support going forward, I think we're going to have to invest in a replacement for iostreams that 1) provides byte output with text support layered on top, 2) is encoding aware (in the text layer), 3) is locale independent (but with explicit opt-in support for locale dependent formatting like that provided by std::format
), 4) is more performant than iostreams.
SG16 would like to hear your thoughts and suggestions. See https://github.com/sg16-unicode/sg16 for contact information.
EDIT: As of 2022-05-22, there is a paper, N2983, making its way through WG14 that seeks to add length modifiers to the formatted I/O functions for char8_t
, char16_t
, and char32_t
characters and strings.
What is the printf() formatting character for char8_t *?
There is no format specifier that will print char8_t*
as a string. Using %s
is technically an undefined behavior because of a type mismatch and clang will warn you about it (https://godbolt.org/z/xcs9Wj):
printf("%s", u8"Привет, мир!");
...: warning: format specifies type 'char *' but the argument has type 'const char8_t *' [-Wformat]
printf("%s", u8"Привет, мир!");
~~ ^~~~~~~~~~~~~~~~
%s
So the only thing you can do is to print such string as a pointer with %p
which is not very useful.
iostreams don't work with char8_t
strings either. For example this doesn't compile in C++20:
std::cout << u8"Привет, мир!";
On most platforms normal char
strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you Unicode support on major operating systems.
For portable Unicode output you can use the {fmt} library, for example (https://godbolt.org/z/3ejsaG):
#include <fmt/core.h>
int main() {
fmt::print("Привет, мир!");
}
prints:
Привет, мир!
Disclaimer: I'm the author of {fmt}.
printf
is not defined by C++20 itself; C++20 includes the C standard library by reference. It will likely reference C18, but that's substantially equal to C11 (no new features; just fixes defect reports).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With