Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Characters supported in C++

There seems to be a problem when I'm writing words in foreign characters (french...)

For example, if I ask for input for an std::string or a char[] like this:

std::string s;
std::cin>>s;  //if we input the string "café"
std::cout<<s<<std::endl;  //outputs "café"

Everything is fine.

Although if the string is hard-coded

std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"

What is going on? What characters are supported by C++ and how do I make it work right? Does it have something to do with my operating system (Windows 10)? My IDE (VS 15)? Or with C++?

like image 773
Tom Dorone Avatar asked Jan 23 '17 16:01

Tom Dorone


3 Answers

In a nutshell, if you want to pass/receive unicode text to/from the console on Windows 10 (in fact any version of Windows), you need to use wide strings, IE, std::wstring. Windows itself doesn't support UTF-8 encoding. This is a fundamental OS limitation.

The entire Win32 API, on which things like console and file system access are based, only works with unicode characters under the UTF-16 encoding, and the C/C++ runtimes provided in Visual Studio don't offer any kind of translation layer to make this API UTF-8 compatible. This doesn't mean you can't use UTF-8 encoding internally, it just means that when you hit the Win32 API, or a C/C++ runtime feature that uses it, you'll need to convert between UTF-8 and UTF-16 encoding. It sucks, but it's just where we are right now.

Some people might direct you to a series of tricks that proport to make the console work with UTF-8. Don't go this route, you'll run into a lot of problems. Only wide-character strings are properly supported for unicode console access.

Edit: Because UTF-8/UTF-16 string conversion is non-trivial, and there also isn't much help provided for this in C++, here are some conversion functions I prepared earlier:

///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
    // Convert the encoding of the supplied string
    std::wstring stringUTF16;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF8.size();
    stringUTF16.reserve(sourceStringSize);
    while (sourceStringPos < sourceStringSize)
    {
        // Determine the number of code units required for the next character
        static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
        unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];

        // Ensure that the requested number of code units are left in the source string
        if ((sourceStringPos + codeUnitCount) > sourceStringSize)
        {
            break;
        }

        // Convert the encoding of this character
        switch (codeUnitCount)
        {
        case 1:
        {
            stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
            break;
        }
        case 2:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 3:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 4:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
            wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
            wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
            stringUTF16.push_back(convertedCodeUnit1);
            stringUTF16.push_back(convertedCodeUnit2);
            break;
        }
        }

        // Advance past the converted code units
        sourceStringPos += codeUnitCount;
    }

    // Return the converted string to the caller
    return stringUTF16;
}

///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
    // Convert the encoding of the supplied string
    std::string stringUTF8;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF16.size();
    stringUTF8.reserve(sourceStringSize * 2);
    while (sourceStringPos < sourceStringSize)
    {
        // Check if a surrogate pair is used for this character
        bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);

        // Ensure that the requested number of code units are left in the source string
        if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
        {
            break;
        }

        // Decode the character from UTF-16 encoding
        unsigned int unicodeCodePoint;
        if (usesSurrogatePair)
        {
            unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
        }
        else
        {
            unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
        }

        // Encode the character into UTF-8 encoding
        if (unicodeCodePoint <= 0x7F)
        {
            stringUTF8.push_back((char)unicodeCodePoint);
        }
        else if (unicodeCodePoint <= 0x07FF)
        {
            char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
            char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
        }
        else if (unicodeCodePoint <= 0xFFFF)
        {
            char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
        }
        else
        {
            char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
            stringUTF8.push_back(convertedCodeUnit4);
        }

        // Advance past the converted code units
        sourceStringPos += (usesSurrogatePair) ? 2 : 1;
    }

    // Return the converted string to the caller
    return stringUTF8;
}

I was in charge of the unenviable task of converting a 6 million line legacy Windows app to support Unicode, when it was only written to support ASCII (in fact its development pre-dates Unicode), where we used std::string and char[] internally to store strings. Since changing all the internal string storage buffers was simply not possible, we needed to adopt UTF-8 internally and convert between UTF-8 and UTF-16 when hitting the Win32 API. These are the conversion functions we used.

I would strongly recommend sticking with what's supported for new Windows development, which means wide strings. That said, there's no reason you can't base the core of your program on UTF-8 strings, but it will make things more tricky when interacting with Windows and various aspects of the C/C++ runtimes.

Edit 2: I've just re-read the original question, and I can see I didn't answer it very well. Let me give some more info that will specifically answer your question.

What's going on? When developing with C++ on Windows, when you use std::string with std::cin/std::cout, the console IO is being done using MBCS encoding. This is a deprecated mode under which the characters are encoded using the currently selected code page on the machine. Values encoded under these code pages are not unicode, and cannot be shared with other systems that have a different code page selected, or even the same system if the code page is changed. It works perfectly in your test, because you're capturing the input under the current code page, and displaying it back under the same code page. If you try capturing that input and saving it to a file, inspection will show it's not unicode. Load it back with a different code page selected in our OS, and the text will appear corrupted. You can only interpret text if you know what code page it was encoded in. Since these legacy code pages are regional, and none of them can represent all text characters, it makes it effectively impossible to share text universally across different machines and computers. MBCS pre-dates the development of unicode, and it was specifically because of these kind of issues that unicode was invented. Unicode is basically the "one code page to rule them all". You might be wondering why UTF-8 isn't a selectable "legacy" code page on Windows. A lot of us are wondering the same thing. Suffice to say, it isn't. As such, you shouldn't rely on MBCS encoding, because you can't get unicode support when using it. Your only option for unicode support on Windows is using std::wstring, and calling the UTF-16 Win32 API's.

As for your example about the string being hard-coded, first of all understand that encoding non-ASCII text into your source file puts you into the realm of compiler-specific behaviour. In Visual Studio, you can actually specify the encoding of the source file (Under File->Advanced Save Options). In your case, the text is coming out different to what you'd expect because it's being encoded (most likely) in UTF-8, but as mentioned, the console output is being done using MBCS encoding on your currently selected code page, which isn't UTF-8. Historically, you would have been advised to avoid any non-ASCII characters in source files, and escape any using the \x notation. Today, there are C++11 string prefixes and suffixes that guarantee various encoding forms. You could try using these if you need this ability. I have no practical experience using them, so I can't advise if there are any issues with this approach.

like image 83
Roger Sanders Avatar answered Sep 26 '22 22:09

Roger Sanders


The problem originates with Windows itself. It uses one character encoding (UTF-16) for most internal operations, another (Windows-1252) for default file encoding, and yet another (Code Page 850 in your case) for console I/O. Your source file is encoded in Windows-1252, where é equates to the single byte '\xe9'. When you display this same code in Code Page 850, it becomes Ú. Using u8"é" produces a two byte sequence "\xc3\xa9", which prints on the console as ├®.

Probably the easiest solution is to avoid putting non-ASCII literals in your code altogether and use the hex code for the character you require. This won't be a pretty or portable solution though.

std::string s="caf\x82";

A better solution would be to use u16 strings and encode them using WideCharToMultiByte.

like image 24
Mark Ransom Avatar answered Sep 26 '22 22:09

Mark Ransom


What characters are supported by C++

The C++ standarad does not specify which characters are supported. It is implementation specific.

Does it have something to do with...

... C++?

No.

... My IDE?

No, although an IDE might have an option to edit a source file in particular encoding.

... my operating system?

This may have an influence.

This is influenced by several things.

  • What is the encoding of the source file.
  • What is the encoding that the compiler uses to interpret the source file.
    • Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly).
    • The native encoding of your operating system probably influences what character encoding your compiler expects by default.
  • What encoding does the terminal that runs the program support.
    • Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly without conversion).
  • Is the used character encoding wide. By wide, I mean whether the width of a code unit is more than CHAR_BIT. A wide source / compiler will cause a conversion into another, narrow encoding since you use a narrow string literal and narrow stream operator. In this case, you'll need to figure out both the native narrow and the native wide character encoding expected by the compiler. The compiler will convert the input string into the narrow encoding. If the narrow encoding has no representation for the character in the input encoding, it might not work correctly.

An example:

Source file is encoded in UTF-8. Compiler expects UTF-8. The terminal expects UTF-8. In this case, what you see is what you get.

like image 26
eerorika Avatar answered Sep 26 '22 22:09

eerorika