Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing UTF-8 strings with printf - wide vs. multibyte string literals

In statements like these, where both are entered into the source code with the same encoding (UTF-8) and the locale is set up properly, is there any practical difference between them?

printf("ο Δικαιοπολις εν αγρω εστιν\n"); printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν\n"); 

And consequently is there any reason to prefer one over the other when doing output? I imagine the second performs a fair bit worse, but does it have any advantage (or disadvantage) over a multibyte literal?

EDIT: There are no issues with these strings printing. But I'm not using the wide string functions, because I want to be able to use printf etc. as well. So the question is are these ways of printing any different (given the situation outlined above), and if so, does the second one have any advantage?

EDIT2: Following the comments below, I now know that this program works -- which I thought wasn't possible:

int main() {     setlocale(LC_ALL, "");     wprintf(L"ο Δικαιοπολις εν αγρω εστιν\n");  // wide output     freopen(NULL, "w", stdout);                 // lets me switch     printf("ο Δικαιοπολις εν αγρω εστιν\n");    // byte output } 

EDIT3: I've done some further research by looking at what's going on with the two types. Take a simpler string:

wchar_t *wides = L"£100 π"; char *mbs = "£100 π"; 

The compiler is generating different code. The wide string is:

.string "\243" .string "" .string "" .string "1" .string "" .string "" .string "0" .string "" .string "" .string "0" .string "" .string "" .string " " .string "" .string "" .string "\300\003" .string "" .string "" .string "" .string "" .string "" 

While the second is:

.string "\302\243100 \317\200" 

And looking at the Unicode encodings, the second is plain UTF-8. The wide character representation is UTF-32. I realise this is going to be implementation-dependent.

So perhaps the wide character representation of literals is more portable? My system will not print UTF-16/UTF-32 encodings directly, so it is being automatically converted to UTF-8 for output.

like image 844
teppic Avatar asked Mar 20 '13 15:03

teppic


1 Answers

printf("ο Δικαιοπολις εν αγρω εστιν\n"); 

prints the string literal (const char*, special characters are represented as multibyte characters). Although you might see the correct output, there are other problems you might be dealing with while working with non-ASCII characters like these. For example:

char str[] = "αγρω"; printf("%d %d\n", sizeof(str), strlen(str)); 

outputs 9 8, since each of these special characters is represented by 2 chars.

While using the L prefix you have the literal consisting of wide characters (const wchar_t*) and %ls format specifier causes these wide characters to be converted to multibyte characters (UTF-8). Note that in this case, locale should be set appropriately otherwise this conversion might lead to the output being invalid:

#include <stdio.h> #include <wchar.h> #include <locale.h>  int main(void) {     setlocale(LC_ALL, "");     printf("%ls", L"ο Δικαιοπολις εν αγρω εστιν");     return 0; } 

but while some things might get more complicated when working with wide characters, other things might get much simpler and more straightforward. For example:

wchar_t str[] = L"αγρω"; printf("%d %d", sizeof(str) / sizeof(wchar_t), wcslen(str)); 

will output 5 4 as one would naturally expect.

Once you decide to work with wide strings, wprintf can be used to print wide characters directly. It's also worth to note here that in case of Windows console, the translation mode of the stdout should be explicitly set to one of the Unicode modes by calling _setmode:

#include <stdio.h> #include <wchar.h>  #include <io.h> #include <fcntl.h> #ifndef _O_U16TEXT   #define _O_U16TEXT 0x20000 #endif  int main() {     _setmode(_fileno(stdout), _O_U16TEXT);     wprintf(L"%s\n", L"ο Δικαιοπολις εν αγρω εστιν");     return 0; } 
like image 112
LihO Avatar answered Sep 26 '22 06:09

LihO