In VC++ 2003, I could just save the source file as UTF-8 and all strings were used as is. In other words, the following code would print the strings as is to the console. If the source file was saved as UTF-8 then the output would be UTF-8. <pre class="prettyprint"><code>printf("Chinese (Traditional)"); printf("中国語 (繁体)"); printf("중국어 (번체)"); printf("Chinês (Tradicional)"); </code></pre> I have saved the file in UTF-8 format with the UTF-8 BOM. However compiling with VC2008 results in: <pre class="prettyprint"><code>warning C4566: character represented by universal-character-name '\uC911' cannot be represented in the current code page (932) warning C4566: character represented by universal-character-name '\uAD6D' cannot be represented in the current code page (932) etc. </code></pre> The characters causing these warnings are corrupted. The ones that do fit the locale (in this case 932 = Japanese) are converted to the locale encoding, i.e. Shift-JIS. I cannot find a way to get VC++ 2008 to compile this for me. Note that it doesn't matter what locale I use in the source file. There doesn't appear to be a locale that says "I know what I'm doing, so don't f$%##ng change my string literals". In particular, the useless UTF-8 pseudo-locale doesn't work. <pre class="prettyprint"><code>#pragma setlocale(".65001") => error C2175: '.65001' : invalid locale </code></pre> Neither does "C": <pre class="prettyprint"><code>#pragma setlocale("C") => see warnings above (in particular locale is still 932) </code></pre> It appears that VC2008 forces all characters into the specified (or default) locale, and that locale cannot be UTF-8. I do not want to change the file to use escape strings like "\xbf\x11..." because the same source is compiled using gcc which can quite happily deal with UTF-8 files. Is there any way to specify that compilation of the source file should leave string literals untouched? To ask it differently, what compile flags can I use to specify backward compatibility with VC2003 when compiling the source file. i.e. do not change the string literals, use them byte for byte as they are. Update Thanks for the suggestions, but I want to avoid wchar. Since this app deals with strings in UTF-8 exclusively, using wchar would then require me to convert all strings back into UTF-8 which should be unnecessary. All input, output and internal processing is in UTF-8. It is a simple app that works fine as is on Linux and when compiled with VC2003. I want to be able to compile the same app with VC2008 and have it work. For this to happen, I need VC2008 to not try to convert it to my local machine's locale (Japanese, 932). I want VC2008 to be backward compatible with VC2003. I want a locale or compiler setting that says strings are used as is, essentially as opaque arrays of char, or as UTF-8. It looks like I might be stuck with VC2003 and gcc though, VC2008 is trying to be too smart in this instance.

Update: I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below). Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program. Original: I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough. The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution). This is what I have found: gcc (v4.3.2 20081105): <ul> <li>string literals are used as is (raw strings)</li> <li>supports UTF-8 encoded source files</li> <li>source files must not have a UTF-8 BOM</li> </ul> vc2003: <ul> <li>string literals are used as is (raw strings)</li> <li>supports UTF-8 encoded source files</li> <li>source files may or may not have a UTF-8 BOM (it doesn't matter)</li> </ul> vc2005+: <ul> <li>string literals are massaged by the compiler (no raw strings)</li> <li>char string literals are re-encoded to a specified locale</li> <li>UTF-8 is not supported as a target locale</li> <li>source files must have a UTF-8 BOM</li> </ul> So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use. There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details. In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following. <pre class="prettyprint"><code>#if defined(_MSC_VER) && _MSC_VER > 1310 // Visual C++ 2005 and later require the source files in UTF-8, and all strings // to be encoded as wchar_t otherwise the strings will be converted into the // local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these // strings then need to be convert back to UTF-8. This function is just a rough // example of how to do this. # define utf8(str) ConvertToUTF8(L##str) const char * ConvertToUTF8(const wchar_t * pStr) { static char szBuf[1024]; WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL); return szBuf; } #else // Visual C++ 2003 and gcc will use the string literals as is, so the files // should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM. # define utf8(str) str #endif </code></pre> Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc). This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008: <pre class="prettyprint"><code>std::string mText; mText = utf8("Chinese (Traditional)"); mText = utf8("中国語 (繁体)"); mText = utf8("중국어 (번체)"); mText = utf8("Chinês (Tradicional)"); </code></pre>

While it is probably better to use wide strings and then convert as needed to UTF-8. I think your best bet is to as you have mentioned use hex escapes in the strings. Like suppose you wanted code point <code>\uC911</code>, you could just do this. <pre class="prettyprint"><code>const char *str = "\xEC\xA4\x91"; </code></pre> I believe this will work just fine, just isn't very readable, so if you do this, please comment it to explain.

How to create a UTF-8 string literal in Visual C++ 2008

Tags:

c++

utf-8

visual-c++

In VC++ 2003, I could just save the source file as UTF-8 and all strings were used as is. In other words, the following code would print the strings as is to the console. If the source file was saved as UTF-8 then the output would be UTF-8.

printf("Chinese (Traditional)"); printf("中国語 (繁体)"); printf("중국어 (번체)"); printf("Chinês (Tradicional)");

I have saved the file in UTF-8 format with the UTF-8 BOM. However compiling with VC2008 results in:

warning C4566: character represented by universal-character-name '\uC911'  cannot be represented in the current code page (932) warning C4566: character represented by universal-character-name '\uAD6D'  cannot be represented in the current code page (932) etc.

The characters causing these warnings are corrupted. The ones that do fit the locale (in this case 932 = Japanese) are converted to the locale encoding, i.e. Shift-JIS.

I cannot find a way to get VC++ 2008 to compile this for me. Note that it doesn't matter what locale I use in the source file. There doesn't appear to be a locale that says "I know what I'm doing, so don't f$%##ng change my string literals". In particular, the useless UTF-8 pseudo-locale doesn't work.

#pragma setlocale(".65001")  => error C2175: '.65001' : invalid locale

Neither does "C":

#pragma setlocale("C")  => see warnings above (in particular locale is still 932)

It appears that VC2008 forces all characters into the specified (or default) locale, and that locale cannot be UTF-8. I do not want to change the file to use escape strings like "\xbf\x11..." because the same source is compiled using gcc which can quite happily deal with UTF-8 files.

Is there any way to specify that compilation of the source file should leave string literals untouched?

To ask it differently, what compile flags can I use to specify backward compatibility with VC2003 when compiling the source file. i.e. do not change the string literals, use them byte for byte as they are.

Update

Thanks for the suggestions, but I want to avoid wchar. Since this app deals with strings in UTF-8 exclusively, using wchar would then require me to convert all strings back into UTF-8 which should be unnecessary. All input, output and internal processing is in UTF-8. It is a simple app that works fine as is on Linux and when compiled with VC2003. I want to be able to compile the same app with VC2008 and have it work.

For this to happen, I need VC2008 to not try to convert it to my local machine's locale (Japanese, 932). I want VC2008 to be backward compatible with VC2003. I want a locale or compiler setting that says strings are used as is, essentially as opaque arrays of char, or as UTF-8. It looks like I might be stuck with VC2003 and gcc though, VC2008 is trying to be too smart in this instance.

410

asked Mar 27 '09 06:03

brofield

2 Answers

Update:

I've decided that there is no guaranteed way to do this. The solution that I present below works for English version VC2003, but fails when compiling with Japanese version VC2003 (or perhaps it is Japanese OS). In any case, it cannot be depended on to work. Note that even declaring everything as L"" strings didn't work (and is painful in gcc as described below).

Instead I believe that you just need to bite the bullet and move all text into a data file and load it from there. I am now storing and accessing the text in INI files via SimpleIni (cross-platform INI-file library). At least there is a guarantee that it works as all text is out of the program.

Original:

I'm answering this myself since only Evan appeared to understand the problem. The answers regarding what Unicode is and how to use wchar_t are not relevant for this problem as this is not about internationalization, nor a misunderstanding of Unicode, character encodings. I appreciate your attempt to help though, apologies if I wasn't clear enough.

The problem is that I have source files that need to be cross-compiled under a variety of platforms and compilers. The program does UTF-8 processing. It doesn't care about any other encodings. I want to have string literals in UTF-8 like currently works with gcc and vc2003. How do I do it with VC2008? (i.e. backward compatible solution).

This is what I have found:

gcc (v4.3.2 20081105):

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files must not have a UTF-8 BOM

vc2003:

string literals are used as is (raw strings)
supports UTF-8 encoded source files
source files may or may not have a UTF-8 BOM (it doesn't matter)

vc2005+:

string literals are massaged by the compiler (no raw strings)
char string literals are re-encoded to a specified locale
UTF-8 is not supported as a target locale
source files must have a UTF-8 BOM

So, the simple answer is that for this particular purpose, VC2005+ is broken and does not supply a backward compatible compile path. The only way to get Unicode strings into the compiled program is via UTF-8 + BOM + wchar which means that I need to convert all strings back to UTF-8 at time of use.

There isn't any simple cross-platform method of converting wchar to UTF-8, for instance, what size and encoding is the wchar in? On Windows, UTF-16. On other platforms? It varies. See the ICU project for some details.

In the end I decided that I will avoid the conversion cost on all compilers other than vc2005+ with source like the following.

#if defined(_MSC_VER) && _MSC_VER > 1310 // Visual C++ 2005 and later require the source files in UTF-8, and all strings  // to be encoded as wchar_t otherwise the strings will be converted into the  // local multibyte encoding and cause errors. To use a wchar_t as UTF-8, these  // strings then need to be convert back to UTF-8. This function is just a rough  // example of how to do this. # define utf8(str)  ConvertToUTF8(L##str) const char * ConvertToUTF8(const wchar_t * pStr) {     static char szBuf[1024];     WideCharToMultiByte(CP_UTF8, 0, pStr, -1, szBuf, sizeof(szBuf), NULL, NULL);     return szBuf; } #else // Visual C++ 2003 and gcc will use the string literals as is, so the files  // should be saved as UTF-8. gcc requires the files to not have a UTF-8 BOM. # define utf8(str)  str #endif

Note that this code is just a simplified example. Production use would need to clean it up in a variety of ways (thread-safety, error checking, buffer size checks, etc).

This is used like the following code. It compiles cleanly and works correctly in my tests on gcc, vc2003, and vc2008:

std::string mText; mText = utf8("Chinese (Traditional)"); mText = utf8("中国語 (繁体)"); mText = utf8("중국어 (번체)"); mText = utf8("Chinês (Tradicional)");

187

answered Oct 14 '22 03:10

brofield

While it is probably better to use wide strings and then convert as needed to UTF-8. I think your best bet is to as you have mentioned use hex escapes in the strings. Like suppose you wanted code point \uC911, you could just do this.

const char *str = "\xEC\xA4\x91";

I believe this will work just fine, just isn't very readable, so if you do this, please comment it to explain.

answered Oct 14 '22 03:10

Evan Teran

Related questions
                            
                                What are the definitions for LPARAM and WPARAM?
                            
                                How to check if a CPU supports the SSE3 instruction set?
                            
                                What's the difference between "static" and "dynamic" schedule in OpenMP?
                            
                                How to change a particular element of a C++ STL vector
                            
                                Enhanced FOR loops in C++
                            
                                Does int main() need a declaration on C++?
                            
                                what's the point of std::unique_ptr::get
                            
                                C / C++ compiler warnings: do you clean up all your code to remove them or leave them in?
                            
                                How do I make Visual Studio pause after executing a console application in debug mode?
                            
                                Workarounds for no 'rvalue references to *this' feature
                            
                                What does it mean to inherit from lambda?
                            
                                Enum variable default value?
                            
                                What is the best way of testing private methods with GoogleTest? [closed]
                            
                                What is the type of string literals in C and C++?
                            
                                Does C++ pass objects by value or reference?
                            
                                C++ - meaning of a statement combining typedef and typename [duplicate]
                            
                                Initializing container of unique_ptrs from initializer list fails with GCC 4.7
                            
                                C/C++ source code visualization? [closed]
                            
                                How do traits classes work and what do they do?
                            
                                Could it be the case that sizeof(T*) != sizeof(const T*)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a UTF-8 string literal in Visual C++ 2008

Tags:

c++

utf-8

visual-c++

brofield

People also ask

2 Answers

brofield

Evan Teran

Recent Activity

Donate For Us