Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the printf() formatting character for char8_t *?

Tags:

c++

utf-8

c++20

Update 2022 Jul 28

P2513R4, char8_t Compatibility and Portability Fix, Draft Proposal, 2022-06-17

Two years after and char8_t definition (or lack of) is now called a "C++20 defect" and there is a rush to fix it. Finally.

Update 2020 Aug 25

The question seems somewhat irrelevant in the light of this:

// GCC 10.2, clang 10.0.1  -std=c++20

int main(int argc, char ** argv) 
{
    char32_t single_glyph_32 = U'ア' ;
    char16_t single_glyph_16 = u'ア' ;
    // gcc:   error: character constant too long for its type
    // clang: error: character too large for enclosing character literal type
    char8_t single_glyph_8 = u8'ア' ;

    return 42;
}

char8_t seems capable of handling just a tiny portion of UTF-8 glyphs. Thus there is no much point in using it or trying to printf it.

Asked Nov 15 '19 at 14:04

And also for char8_t?

I assume there is some C++20 decision, somewhere, but I could not find it. There is also P1428, but that doc is not mentioning anything about printf() family v.s. char8_t * or char8_t.

Use std::cout advice might be an answer. Unfortunately, that does not compile anymore.

// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout <<  u8"A2";
std::cout <<  char8_t ('A');

For C 2.x and char8_t

Please start from here.

Update

I have done some more tests with a single element from a u8 sequence. And that indeed does not work. char8_t * to printf("%s") does work, but char8_t to printf("%c") is an accident waiting to happen.

Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, char8_t is not implemented, char8_t * is. -- let me repeat: there is no implemented type to hold a single element from a char8_t * sequence.

If you want a single u8 glyph you need to code it as an u8 string

char8_t const * single_glyph = u8"ア";

And it seems at present, to print the above the sort of a sure way is

// works with warnings
std::printf("%s", single_glyph ) ;

To start reading on this subject, probably these two papers are required

  1. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

In that order.


My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]

like image 961
Chef Gladiator Avatar asked Nov 15 '19 14:11

Chef Gladiator


People also ask

What is char8_t?

The /Zc:char8_t compiler option enables the char8_t type keyword as specified in the C++20 standard. It causes the compiler to generate u8 prefixed character or string literals as const char8_t or const char8_t[N] types, respectively, instead of as const char or const char[N] types.

What does %s mean in C?

The %s means, "insert the first argument, a string, right here." The %d indicates that the second argument (an integer) should be placed there. There are different %-codes for different variable types, as well as options to limit the length of the variables and whatnot. Control Character.

What is printf * s in C?

%s tells printf that the corresponding argument is to be treated as a string (in C terms, a 0-terminated sequence of char ); the type of the corresponding argument must be char * . %d tells printf that the corresponding argument is to be treated as an integer value; the type of the corresponding argument must be int .

What does %% mean in printf?

Save this answer. Show activity on this post. % indicates a format escape sequence used for formatting the variables passed to printf() . So you have to escape it to print the % character.


3 Answers

I'm the author of the char8_t P0482 and P1423 proposals for C++ (accepted for C++20) and the N2653 proposal for C (accepted for C23).

Let's think about what the following should do:

printf("Hello %s\n", u8"Jöel");
std::cout << "Hello " << u8"Jöel" << "\n";

Actually, let's take a further step back. What encoding is expected on the receiver side of standard output? There are a few possibilities. If standard out is connected to a console/terminal, then the expected encoding is the one that the console/terminal is configured for. On a Windows system in the United States, this is likely to be CP437. On a UNIX/Linux system, this is likely UTF-8. On a z/OS system in the United States, this is likely EBCDIC code page 037. If standard out has been redirected, then the expected encoding is likely locale dependent. On a Windows system in the United States, that would mean the Active Code Page (ACP), likely Windows 1252. On UNIX/Linux and z/OS, it would likely be the same as the console/terminal (Windows is the odd system here that has different defaults for console encoding vs locale encoding).

Back to that example code. What is the expected or desired behavior for that UTF-8 encoded ö character (U+00F6, {LATIN SMALL LETTER O WITH DIAERESIS}, encoded as 0xC3 0xB6)? For Windows writing to the console, for the character to display properly, the encoded sequence would need to be transcoded to 0x94 while for Windows where locale dependent output is expected, it would need to be transcoded to 0xF6. For UNIX/Linux, the sequence should probably be passed through. For z/OS, it may need to be transcoded to 0xCC. But on all of these systems, these defaults are configurable (e.g., via the LANG environment variable).

Assuming that transcoding to a run-time determined encoding is the desired behavior, how should transcoding errors be handled? For example, what should happen if the target encoding lacks representation for ö? What if an ill-formed UTF-8 sequence is present? Should printf stop and report an error? Should std::cout throw an exception? Or should an implementation defined character such as U+FFFD {REPLACEMENT CHARACTER} or ? be substituted?

What should happen if std::cout is imbued with a std::codecvt facet? Presumably that facet will expect incoming text to be in a particular encoding. Should UTF-8 text be transcoded to one of the execution character set, the locale dependent encoding, or the console/terminal encoding before being presented to the facet? If so, which one? Should the implementation have to be aware of whether the stream is connected to a console/terminal? What if the programmer wants to override the default and, for example, always write UTF-8?

These are rather difficult questions that we don't have good answers for. std::u8out has been suggested, as a way to explicitly opt-in to UTF-8, but doesn't solve the problems of expected standard output encoding, issues with codecvt facets, and other iostreams problems like implicit locale dependent formatting.

Personally, in order to provide good Unicode support going forward, I think we're going to have to invest in a replacement for iostreams that 1) provides byte output with text support layered on top, 2) is encoding aware (in the text layer), 3) is locale independent (but with explicit opt-in support for locale dependent formatting like that provided by std::format), 4) is more performant than iostreams.

SG16 would like to hear your thoughts and suggestions. See https://github.com/sg16-unicode/sg16 for contact information.

EDIT: As of 2022-05-22, there is a paper, N2983, making its way through WG14 that seeks to add length modifiers to the formatted I/O functions for char8_t, char16_t, and char32_t characters and strings.

like image 183
Tom Honermann Avatar answered Nov 05 '22 12:11

Tom Honermann


What is the printf() formatting character for char8_t *?

There is no format specifier that will print char8_t* as a string. Using %s is technically an undefined behavior because of a type mismatch and clang will warn you about it (https://godbolt.org/z/xcs9Wj):

printf("%s", u8"Привет, мир!");
...: warning: format specifies type 'char *' but the argument has type 'const char8_t *' [-Wformat]
  printf("%s", u8"Привет, мир!");
          ~~   ^~~~~~~~~~~~~~~~
          %s

So the only thing you can do is to print such string as a pointer with %p which is not very useful.

iostreams don't work with char8_t strings either. For example this doesn't compile in C++20:

std::cout << u8"Привет, мир!";

On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you Unicode support on major operating systems.

For portable Unicode output you can use the {fmt} library, for example (https://godbolt.org/z/3ejsaG):

#include <fmt/core.h>

int main() {
  fmt::print("Привет, мир!");
}

prints:

Привет, мир!

Disclaimer: I'm the author of {fmt}.

like image 20
vitaut Avatar answered Nov 05 '22 11:11

vitaut


printf is not defined by C++20 itself; C++20 includes the C standard library by reference. It will likely reference C18, but that's substantially equal to C11 (no new features; just fixes defect reports).

like image 2
MSalters Avatar answered Nov 05 '22 12:11

MSalters