There seems to be a problem when I'm writing words in foreign characters (french...)
For example, if I ask for input for an std::string or a char[] like this:
std::string s;
std::cin>>s; //if we input the string "café"
std::cout<<s<<std::endl; //outputs "café"
Everything is fine.
Although if the string is hard-coded
std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"
What is going on? What characters are supported by C++ and how do I make it work right? Does it have something to do with my operating system (Windows 10)? My IDE (VS 15)? Or with C++?
In a nutshell, if you want to pass/receive unicode text to/from the console on Windows 10 (in fact any version of Windows), you need to use wide strings, IE, std::wstring. Windows itself doesn't support UTF-8 encoding. This is a fundamental OS limitation.
The entire Win32 API, on which things like console and file system access are based, only works with unicode characters under the UTF-16 encoding, and the C/C++ runtimes provided in Visual Studio don't offer any kind of translation layer to make this API UTF-8 compatible. This doesn't mean you can't use UTF-8 encoding internally, it just means that when you hit the Win32 API, or a C/C++ runtime feature that uses it, you'll need to convert between UTF-8 and UTF-16 encoding. It sucks, but it's just where we are right now.
Some people might direct you to a series of tricks that proport to make the console work with UTF-8. Don't go this route, you'll run into a lot of problems. Only wide-character strings are properly supported for unicode console access.
Edit: Because UTF-8/UTF-16 string conversion is non-trivial, and there also isn't much help provided for this in C++, here are some conversion functions I prepared earlier:
///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
// Convert the encoding of the supplied string
std::wstring stringUTF16;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF8.size();
stringUTF16.reserve(sourceStringSize);
while (sourceStringPos < sourceStringSize)
{
// Determine the number of code units required for the next character
static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];
// Ensure that the requested number of code units are left in the source string
if ((sourceStringPos + codeUnitCount) > sourceStringSize)
{
break;
}
// Convert the encoding of this character
switch (codeUnitCount)
{
case 1:
{
stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
break;
}
case 2:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 3:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 4:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
stringUTF16.push_back(convertedCodeUnit1);
stringUTF16.push_back(convertedCodeUnit2);
break;
}
}
// Advance past the converted code units
sourceStringPos += codeUnitCount;
}
// Return the converted string to the caller
return stringUTF16;
}
///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
// Convert the encoding of the supplied string
std::string stringUTF8;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF16.size();
stringUTF8.reserve(sourceStringSize * 2);
while (sourceStringPos < sourceStringSize)
{
// Check if a surrogate pair is used for this character
bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);
// Ensure that the requested number of code units are left in the source string
if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
{
break;
}
// Decode the character from UTF-16 encoding
unsigned int unicodeCodePoint;
if (usesSurrogatePair)
{
unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
}
else
{
unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
}
// Encode the character into UTF-8 encoding
if (unicodeCodePoint <= 0x7F)
{
stringUTF8.push_back((char)unicodeCodePoint);
}
else if (unicodeCodePoint <= 0x07FF)
{
char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
}
else if (unicodeCodePoint <= 0xFFFF)
{
char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
}
else
{
char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
stringUTF8.push_back(convertedCodeUnit4);
}
// Advance past the converted code units
sourceStringPos += (usesSurrogatePair) ? 2 : 1;
}
// Return the converted string to the caller
return stringUTF8;
}
I was in charge of the unenviable task of converting a 6 million line legacy Windows app to support Unicode, when it was only written to support ASCII (in fact its development pre-dates Unicode), where we used std::string and char[] internally to store strings. Since changing all the internal string storage buffers was simply not possible, we needed to adopt UTF-8 internally and convert between UTF-8 and UTF-16 when hitting the Win32 API. These are the conversion functions we used.
I would strongly recommend sticking with what's supported for new Windows development, which means wide strings. That said, there's no reason you can't base the core of your program on UTF-8 strings, but it will make things more tricky when interacting with Windows and various aspects of the C/C++ runtimes.
Edit 2: I've just re-read the original question, and I can see I didn't answer it very well. Let me give some more info that will specifically answer your question.
What's going on? When developing with C++ on Windows, when you use std::string with std::cin/std::cout, the console IO is being done using MBCS encoding. This is a deprecated mode under which the characters are encoded using the currently selected code page on the machine. Values encoded under these code pages are not unicode, and cannot be shared with other systems that have a different code page selected, or even the same system if the code page is changed. It works perfectly in your test, because you're capturing the input under the current code page, and displaying it back under the same code page. If you try capturing that input and saving it to a file, inspection will show it's not unicode. Load it back with a different code page selected in our OS, and the text will appear corrupted. You can only interpret text if you know what code page it was encoded in. Since these legacy code pages are regional, and none of them can represent all text characters, it makes it effectively impossible to share text universally across different machines and computers. MBCS pre-dates the development of unicode, and it was specifically because of these kind of issues that unicode was invented. Unicode is basically the "one code page to rule them all". You might be wondering why UTF-8 isn't a selectable "legacy" code page on Windows. A lot of us are wondering the same thing. Suffice to say, it isn't. As such, you shouldn't rely on MBCS encoding, because you can't get unicode support when using it. Your only option for unicode support on Windows is using std::wstring, and calling the UTF-16 Win32 API's.
As for your example about the string being hard-coded, first of all understand that encoding non-ASCII text into your source file puts you into the realm of compiler-specific behaviour. In Visual Studio, you can actually specify the encoding of the source file (Under File->Advanced Save Options). In your case, the text is coming out different to what you'd expect because it's being encoded (most likely) in UTF-8, but as mentioned, the console output is being done using MBCS encoding on your currently selected code page, which isn't UTF-8. Historically, you would have been advised to avoid any non-ASCII characters in source files, and escape any using the \x notation. Today, there are C++11 string prefixes and suffixes that guarantee various encoding forms. You could try using these if you need this ability. I have no practical experience using them, so I can't advise if there are any issues with this approach.
The problem originates with Windows itself. It uses one character encoding (UTF-16) for most internal operations, another (Windows-1252) for default file encoding, and yet another (Code Page 850 in your case) for console I/O. Your source file is encoded in Windows-1252, where é
equates to the single byte '\xe9'
. When you display this same code in Code Page 850, it becomes Ú
. Using u8"é"
produces a two byte sequence "\xc3\xa9"
, which prints on the console as ├®
.
Probably the easiest solution is to avoid putting non-ASCII literals in your code altogether and use the hex code for the character you require. This won't be a pretty or portable solution though.
std::string s="caf\x82";
A better solution would be to use u16
strings and encode them using WideCharToMultiByte
.
What characters are supported by C++
The C++ standarad does not specify which characters are supported. It is implementation specific.
Does it have something to do with...
... C++?
No.
... My IDE?
No, although an IDE might have an option to edit a source file in particular encoding.
... my operating system?
This may have an influence.
This is influenced by several things.
An example:
Source file is encoded in UTF-8. Compiler expects UTF-8. The terminal expects UTF-8. In this case, what you see is what you get.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With