Update 2022 Jul 28 <h4> P2513R4, char8_t Compatibility and Portability Fix, Draft Proposal, 2022-06-17</h4> Two years after and <code>char8_t</code> definition (or lack of) is now called a "C++20 defect" and there is a rush to fix it. Finally. Update 2020 Aug 25 The question seems somewhat irrelevant in the light of this: <pre class="prettyprint"><code>// GCC 10.2, clang 10.0.1 -std=c++20 int main(int argc, char ** argv) { char32_t single_glyph_32 = U'ア' ; char16_t single_glyph_16 = u'ア' ; // gcc: error: character constant too long for its type // clang: error: character too large for enclosing character literal type char8_t single_glyph_8 = u8'ア' ; return 42; } </code></pre> char8_t seems capable of handling just a tiny portion of UTF-8 glyphs. Thus there is no much point in using it or trying to printf it. Asked Nov 15 '19 at 14:04 And also for <code>char8_t</code>? I assume there is some C++20 decision, somewhere, but I could not find it. There is also P1428, but that doc is not mentioning anything about <code>printf()</code> family v.s. <code>char8_t *</code> or <code>char8_t</code>. Use <code>std::cout</code> advice might be an answer. Unfortunately, that does not compile anymore. <pre class="prettyprint lang-cpp prettyprint-override"><code>// does not compile under C++20 // error : overload resolution selected deleted operator '<<' // see P1423, proposal 7 std::cout << u8"A2"; std::cout << char8_t ('A'); </code></pre> For C 2.x and char8_t Please start from here. Update I have done some more tests with a single element from a u8 sequence. And that indeed does not work. <code>char8_t *</code> to <code>printf("%s")</code> does work, but <code>char8_t</code> to <code>printf("%c")</code> is an accident waiting to happen. Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, <code>char8_t</code> is not implemented, <code>char8_t *</code> is. -- let me repeat: there is no implemented type to hold a single element from a <code>char8_t *</code> sequence. If you want a single u8 glyph you need to code it as an u8 string <pre class="prettyprint lang-cpp prettyprint-override"><code>char8_t const * single_glyph = u8"ア"; </code></pre> And it seems at present, to print the above the sort of a sure way is <pre class="prettyprint lang-cpp prettyprint-override"><code>// works with warnings std::printf("%s", single_glyph ) ; </code></pre> To start reading on this subject, probably these two papers are required <ol> <li>http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm</li> <li>http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html</li> </ol> In that order. <hr> My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]

<blockquote> What is the printf() formatting character for char8_t *? </blockquote> There is no format specifier that will print <code>char8_t*</code> as a string. Using <code>%s</code> is technically an undefined behavior because of a type mismatch and clang will warn you about it (https://godbolt.org/z/xcs9Wj): <pre class="prettyprint lang-cpp prettyprint-override"><code>printf("%s", u8"Привет, мир!"); </code></pre> <pre class="prettyprint"><code>...: warning: format specifies type 'char *' but the argument has type 'const char8_t *' [-Wformat] printf("%s", u8"Привет, мир!"); ~~ ^~~~~~~~~~~~~~~~ %s </code></pre> So the only thing you can do is to print such string as a pointer with <code>%p</code> which is not very useful. iostreams don't work with <code>char8_t</code> strings either. For example this doesn't compile in C++20: <pre class="prettyprint lang-cpp prettyprint-override"><code>std::cout << u8"Привет, мир!"; </code></pre> On most platforms normal <code>char</code> strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you Unicode support on major operating systems. For portable Unicode output you can use the {fmt} library, for example (https://godbolt.org/z/3ejsaG): <pre class="prettyprint lang-cpp prettyprint-override"><code>#include <fmt/core.h> int main() { fmt::print("Привет, мир!"); } </code></pre> prints: <pre class="prettyprint"><code>Привет, мир! </code></pre> Disclaimer: I'm the author of {fmt}.

What is the printf() formatting character for char8_t *?

Tags:

c++

utf-8

c++20

Update 2022 Jul 28

P2513R4, char8_t Compatibility and Portability Fix, Draft Proposal, 2022-06-17

Two years after and char8_t definition (or lack of) is now called a "C++20 defect" and there is a rush to fix it. Finally.

Update 2020 Aug 25

The question seems somewhat irrelevant in the light of this:

// GCC 10.2, clang 10.0.1  -std=c++20

int main(int argc, char ** argv) 
{
    char32_t single_glyph_32 = U'ア' ;
    char16_t single_glyph_16 = u'ア' ;
    // gcc:   error: character constant too long for its type
    // clang: error: character too large for enclosing character literal type
    char8_t single_glyph_8 = u8'ア' ;

    return 42;
}

char8_t seems capable of handling just a tiny portion of UTF-8 glyphs. Thus there is no much point in using it or trying to printf it.

Asked Nov 15 '19 at 14:04

And also for char8_t?

I assume there is some C++20 decision, somewhere, but I could not find it. There is also P1428, but that doc is not mentioning anything about printf() family v.s. char8_t * or char8_t.

Use std::cout advice might be an answer. Unfortunately, that does not compile anymore.

// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout <<  u8"A2";
std::cout <<  char8_t ('A');

For C 2.x and char8_t

Please start from here.

Update

I have done some more tests with a single element from a u8 sequence. And that indeed does not work. char8_t * to printf("%s") does work, but char8_t to printf("%c") is an accident waiting to happen.

Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, char8_t is not implemented, char8_t * is. -- let me repeat: there is no implemented type to hold a single element from a char8_t * sequence.

If you want a single u8 glyph you need to code it as an u8 string

char8_t const * single_glyph = u8"ア";

And it seems at present, to print the above the sort of a sure way is

// works with warnings
std::printf("%s", single_glyph ) ;

To start reading on this subject, probably these two papers are required

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

In that order.

My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]

961

asked Nov 15 '19 14:11

Chef Gladiator

3 Answers

I'm the author of the char8_t P0482 and P1423 proposals for C++ (accepted for C++20) and the N2653 proposal for C (accepted for C23).

Let's think about what the following should do:

printf("Hello %s\n", u8"Jöel");
std::cout << "Hello " << u8"Jöel" << "\n";

Actually, let's take a further step back. What encoding is expected on the receiver side of standard output? There are a few possibilities. If standard out is connected to a console/terminal, then the expected encoding is the one that the console/terminal is configured for. On a Windows system in the United States, this is likely to be CP437. On a UNIX/Linux system, this is likely UTF-8. On a z/OS system in the United States, this is likely EBCDIC code page 037. If standard out has been redirected, then the expected encoding is likely locale dependent. On a Windows system in the United States, that would mean the Active Code Page (ACP), likely Windows 1252. On UNIX/Linux and z/OS, it would likely be the same as the console/terminal (Windows is the odd system here that has different defaults for console encoding vs locale encoding).

Back to that example code. What is the expected or desired behavior for that UTF-8 encoded ö character (U+00F6, {LATIN SMALL LETTER O WITH DIAERESIS}, encoded as 0xC3 0xB6)? For Windows writing to the console, for the character to display properly, the encoded sequence would need to be transcoded to 0x94 while for Windows where locale dependent output is expected, it would need to be transcoded to 0xF6. For UNIX/Linux, the sequence should probably be passed through. For z/OS, it may need to be transcoded to 0xCC. But on all of these systems, these defaults are configurable (e.g., via the LANG environment variable).

Assuming that transcoding to a run-time determined encoding is the desired behavior, how should transcoding errors be handled? For example, what should happen if the target encoding lacks representation for ö? What if an ill-formed UTF-8 sequence is present? Should printf stop and report an error? Should std::cout throw an exception? Or should an implementation defined character such as U+FFFD {REPLACEMENT CHARACTER} or ? be substituted?

What should happen if std::cout is imbued with a std::codecvt facet? Presumably that facet will expect incoming text to be in a particular encoding. Should UTF-8 text be transcoded to one of the execution character set, the locale dependent encoding, or the console/terminal encoding before being presented to the facet? If so, which one? Should the implementation have to be aware of whether the stream is connected to a console/terminal? What if the programmer wants to override the default and, for example, always write UTF-8?

These are rather difficult questions that we don't have good answers for. std::u8out has been suggested, as a way to explicitly opt-in to UTF-8, but doesn't solve the problems of expected standard output encoding, issues with codecvt facets, and other iostreams problems like implicit locale dependent formatting.

Personally, in order to provide good Unicode support going forward, I think we're going to have to invest in a replacement for iostreams that 1) provides byte output with text support layered on top, 2) is encoding aware (in the text layer), 3) is locale independent (but with explicit opt-in support for locale dependent formatting like that provided by std::format), 4) is more performant than iostreams.

SG16 would like to hear your thoughts and suggestions. See https://github.com/sg16-unicode/sg16 for contact information.

EDIT: As of 2022-05-22, there is a paper, N2983, making its way through WG14 that seeks to add length modifiers to the formatted I/O functions for char8_t, char16_t, and char32_t characters and strings.

183

answered Nov 05 '22 12:11

Tom Honermann

What is the printf() formatting character for char8_t *?

There is no format specifier that will print char8_t* as a string. Using %s is technically an undefined behavior because of a type mismatch and clang will warn you about it (https://godbolt.org/z/xcs9Wj):

printf("%s", u8"Привет, мир!");

...: warning: format specifies type 'char *' but the argument has type 'const char8_t *' [-Wformat]
  printf("%s", u8"Привет, мир!");
          ~~   ^~~~~~~~~~~~~~~~
          %s

So the only thing you can do is to print such string as a pointer with %p which is not very useful.

iostreams don't work with char8_t strings either. For example this doesn't compile in C++20:

std::cout << u8"Привет, мир!";

On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you Unicode support on major operating systems.

For portable Unicode output you can use the {fmt} library, for example (https://godbolt.org/z/3ejsaG):

#include <fmt/core.h>

int main() {
  fmt::print("Привет, мир!");
}

prints:

Привет, мир!

Disclaimer: I'm the author of {fmt}.

answered Nov 05 '22 11:11

vitaut

printf is not defined by C++20 itself; C++20 includes the C standard library by reference. It will likely reference C18, but that's substantially equal to C11 (no new features; just fixes defect reports).

answered Nov 05 '22 12:11

MSalters

Related questions
                            
                                Returning a member from an rvalue object
                            
                                C++ constexpr : Compute a std array at compile time
                            
                                Why doesn't clang allow accessing a nested enum class through an instance?
                            
                                using shared_ptr to std::vector in range-based for loop
                            
                                How can I use C++ STL containers in the implementation file of a C library?
                            
                                Template argument deduction/substitution failed with lambda as function pointer
                            
                                Template dependent name resolution should not find declarations with no linkage?
                            
                                Cannot use __try in functions that require object unwinding fix
                            
                                RESTful API requests using Qt
                            
                                Type of variables in structured binding
                            
                                Is NULL guaranteed to be 0?
                            
                                Why is the value of std::string::max_size "strange"?
                            
                                Could NOT find Protobuf (missing: Protobuf_PROTOC_EXECUTABLE)
                            
                                C++: Declaration of parameter hides class member even with "this" keyword
                            
                                Is there a way to enforce using "this->" for class members/methods in clang-format/clang-tidy?
                            
                                auto return type not deducing reference
                            
                                std::set with string key and potential efficiency loss
                            
                                Why is this recursive lambda function unsafe?
                            
                                Dependent non-type parameter packs: what does the standard say?
                            
                                Is east constexpr / constinit / consteval allowed in C++20?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the printf() formatting character for char8_t *?

Tags:

c++

utf-8

c++20

P2513R4, char8_t Compatibility and Portability Fix, Draft Proposal, 2022-06-17

Chef Gladiator

People also ask

3 Answers

Tom Honermann

vitaut

MSalters

Recent Activity

Donate For Us