Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode problems in C++ but not C

Tags:

c++

c

unicode

utf-8

I'm trying to write unicode strings to the screen in C++ on Windows. I changed my console font to Lucida Console and I set the output to CP_UTF8 aka 65001.

I run the following code:

#include <stdio.h>  //notice this header file..
#include <windows.h>
#include <iostream>

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    const char text[] = "Россия";
    printf("%s\n", text);
}

It prints out just fine!

However, if I do:

#include <cstdio>  //the C++ version of the header..
#include <windows.h>
#include <iostream>

int main()
{
    SetConsoleOutputCP(CP_UTF8);
    const char text[] = "Россия";
    printf("%s\n", text);
}

it prints: ������������

I have NO clue why..

Another thing is when I do:

#include <windows.h>
#include <iostream>

int main()
{
    std::uint32_t oldcodepage = GetConsoleOutputCP();
    SetConsoleOutputCP(CP_UTF8);

    std::string text = u8"Россия";
    std::cout<<text<<"\n";

    SetConsoleOutputCP(oldcodepage);
}

I get the same output as above (non-working output).

Using printf on the std::string, it works fine though:

#include <stdio.h>
#include <windows.h>
#include <iostream>

int main()
{
    std::uint32_t oldcodepage = GetConsoleOutputCP();
    SetConsoleOutputCP(CP_UTF8);

    std::string text = u8"Россия";
    printf("%s\n", text.c_str());

    SetConsoleOutputCP(oldcodepage);
}

but only if I use stdio.h and NOT cstdio.

Any ideas how I can use std::cout? How can I use cstdio as well? Why does this happen? Isn't cstdio just a c++ version of stdio.h?

EDIT: I've just tried:

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);
    std::wcout << L"Россия" << std::endl;
}

and yes it works but only if I use std::wcout and wide strings. I would really like to avoid wide-strings and the only solution I see so far is the C-printf :l

So the question still stands..

like image 846
Brandon Avatar asked Jan 26 '14 23:01

Brandon


People also ask

Can C handle Unicode?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.

Does C use Unicode or ASCII?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.

What is difference between Unicode and ASCII?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.

What is a non Unicode character?

What is Non-Unicode? Non-Unicode is a term used to refer to modules or character encodings that do not support the Unicode standard. ACL Desktop and AuditExchange are available in both non-Unicode and Unicode Editions.


1 Answers

Although you've set your console to expect UTF-8 output, I suspect that your compiler is treating string literals as being in some other character set. I don't know why the C compiler acts differently.

The good news is that C++11 includes some support for UTF-8, and that Microsoft has implemented the relevant portions of the Standard. The code is a little hairy, but you'll want to look into std::wstring_convert (converts to and from UTF-8) and the <cuchar> header.

You can use those functions to convert to UTF-8, and assuming your console is expecting UTF-8, things should work correctly.

Personally, when I need to debug something like this, I often direct the output to a text file. Text editors seem to handle Unicode better than the Windows console. In my case, I often output the code points correctly, but have the console set up incorrectly so that I still end up printing garbage.


I can tell you that this worked for me in both Linux (using Clang) and Windows (using GCC 4.7.3 and Clang 3.5; you need to add "std=c++11" to the command line to compile with GCC or Clang):

#include <cstdio>

int main()
{
    const char text[] = u8"Россия";
    std::printf("%s\n", text);
}

Using Visual C++ (2012, but I believe it would also work with 2010), I had to use:

#include <codecvt>
#include <cstdio>
#include <locale>
#include <string>

int main()
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    auto text = converter.to_bytes(L"Россия");
    std::printf("%s\n", text.c_str());
}
like image 68
Max Lybbert Avatar answered Oct 21 '22 20:10

Max Lybbert