GCC has <code>-finput-charset</code>, <code>-fexec-charset</code> and <code>-fwide-exec-charset</code> three compile options to specify particular encodings involved in a "compile chain". Like the following: <pre class="prettyprint"><code>+--------+ -finput-charset +----------+ -fexec-charset (or) +-----+ | source | -------------------> | compiler | -----------------------> | exe | +--------+ +----------+ -fwide-exec-charset +-----+ </code></pre> Reference: GCC compiler options I found a question about <code>-finput-charset</code> here: Specification of source charset encoding in MSVC++, like gcc “-finput-charset=CharSet”. But I want to know whether <code>VC</code> has a compiler option like <code>-fexec-charset</code> in GCC to specify the execution character set. I found a seemed relative option in Visual Studio: <code>Project Properties/Configuration Properties/General/Character Set</code>. And the value is <code>Use Unicode Character Set</code>. Does it do the same thing as <code>-fexec-charset</code> in GCC? In that way I want to set the execution character set to UTF-8. How to? <h3>Why I want to set the encoding of the execution?</h3> I'm writing an application in C++ which needs to communicate with a db server. And the charset of the tables is utf8. After I build some tests, the tests will catch exceptions thrown around insertion operations on db tables. The exceptions tell me that they meet incorrect string values. I suppose that it's caused by the wrong encoding right? BTW, are there any other ways to handle this issue?

It should be noted that the pragma <code>execution_character_set</code> does only seem to apply to character string literals (<code>"Hello World"</code>) and not wide character string literals (<code>L"Hello World"</code>). I did some experiments to find out how source and execution character sets are implemented in MSVC. I did the experiments with Visual Studio 2015 on a Windows system where <code>CP_ACP</code> is 1252 and summarize the results as follows: Character literals <ul> <li>If MSVC determines the source file to be a Unicode file, that is it is encoded in UTF-8 or UTF-16, it converts characters to <code>CP_ACP</code>. If a Unicode character is not within the range of <code>CP_ACP</code>, MSVC issues a C4566 warning ("character represented by universal-character-name '\U0001D575' cannot be represented in the current code page (1252)"). MSVC assumes the execution character set of the compiled software is <code>CP_ACP</code> of the compiler. That implies that you should compile the software under the <code>CP_ACP</code> of the target environment, i.e. if you want to execute the software on a Windows system with code page 1252 you should compile it under code page 1252 and not execute it on a system with any other code page. In practice it might work if your literals are ASCII encoded (C0 Control and Basic Latin Unicode block) since most common SBCS code pages extend this encoding. However, there are some which do not, especially DBCS code pages</li> <li>If MSVC determines that the source file is not a Unicode file, it interprets the source file according to <code>CP_ACP</code> and assumes that the execution character set is <code>CP_ACP</code>. As with Unicode files you should compile the software under the <code>CP_ACP</code> of the target environment and have the same problems.</li> </ul> All "ANSI" Windows API functions (e.g. <code>CreateFileA</code>) interpret strings of type <code>LPSTR</code> according to <code>CP_ACP</code> or <code>CP_THREAD_ACP</code> (which defaults to <code>CP_ACP</code>). It's not easy to find out which functions use <code>CP_ACP</code> or <code>CP_THREAD_ACP</code> so it's best to never change <code>CP_THREAD_ACP</code>. Wide character literals The execution character set for wide character literals is always Unicode and the encoding is UTF-16LE. All wide character Windows API functions (e.g. <code>CreateFile</code>) interpret string of type <code>LPWSTR</code> as UTF-16LE strings. That also implies that <code>wcslen</code> does not return the number of Unicode characters but the number <code>wchar_t</code> characters of a wide character string. UTF-16 is also different from UCS-2 in some cases. <ul> <li>If MSVC determines the source file to be a Unicode file, it converts the characters to UTF-16LE.</li> <li>If MSVC determines that the source file is not a Unicode file, it reads file according to <code>CP_ACP</code> and extends the characters to two bytes without interpreting them. That is, if a character is encoded as <code>0xFF</code> in <code>CP_ACP</code> it will be written as <code>0x00 0xFF</code> regardless of whether the <code>CP_ACP</code> character <code>0xFF</code> is the Unicode character <code>U+00FF</code>.</li> </ul> I haven't had the chance to repeat my experiments on a DBCS Windows system because I don't speak the languages that usually use such code pages. Perhaps some body can repeat the experiments on such a system. For me the conclusion of the experiment is that you should avoid character literals, even if you use the <code>execution_character_set</code> pragma. The pragma just changes how character string literals are encoded in the binary but does not change the execution character set of the libraries you use or the kernel. If you wanted to use the <code>execution_character_set</code> pragma, you would have to recompile Windows and all other libraries you use completely with the pragma which is of course impossible. So I would recommend against using it. It might work for some systems since UTF-8 works with most character string functions in the CRT and <code>CP_ACP</code> usually includes ASCII but you should check whether these assumptions really hold in your target environment and whether the required effort of this misuse is really worth it. Moreover, the pragma seems to be undocumented and I might not work in future releases. Otherwise you have to compile separate binaries for all code pages that are in use in your target systems. The only way to avoid multiple binaries would be when you externalize all strings to resources which are UTF-16LE encoded and convert the strings to <code>CP_ACP</code> if required. In this case you have to save the resource scripts (<code>.rc</code> files) as UTF-8, invoke <code>rc</code> with <code>/c65001</code> (UTF-16LE does not work) and include the strings for all code pages that are in use in your target systems. I would advice to encode your files in a Unicode encoding, such as UTF-8 or UTF-16LE, and use wide character literals if you can't externalize the strings to resources and compile with <code>UNICODE</code> and <code>_UNICODE</code> defined. It's not advisable to use string and character literals anyhow, prefer resources. Use <code>WideCharacterToMultiByte</code> and <code>MultiByteToWideChar</code> for functions which expect strings that are encoded according to <code>CP_ACP</code> or some other code page. The source encoding detection heuristic of MSVC works best with BOM enabled (even in UTF-8). I'm not an expert on Asian languages but I read that han unification in Unicode is controversial. So using Unicode might not be the solution to all problems and there might be cases where it doesn't meet the requirements but I would say for the majority languages Unicode is what works best under Windows. It's a mistake of Microsoft to be not explicit about this and document the behaviour of their compilers and operating system.

Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?

Tags:

c++

character-encoding

gcc

utf-8

visual-c++

GCC has -finput-charset, -fexec-charset and -fwide-exec-charset three compile options to specify particular encodings involved in a "compile chain". Like the following:

+--------+   -finput-charset     +----------+    -fexec-charset (or)    +-----+
| source | ------------------->  | compiler |  -----------------------> | exe |
+--------+                       +----------+    -fwide-exec-charset    +-----+

Reference: GCC compiler options

I found a question about -finput-charset here: Specification of source charset encoding in MSVC++, like gcc “-finput-charset=CharSet”. But I want to know whether VC has a compiler option like -fexec-charset in GCC to specify the execution character set.

I found a seemed relative option in Visual Studio: Project Properties/Configuration Properties/General/Character Set. And the value is Use Unicode Character Set. Does it do the same thing as -fexec-charset in GCC? In that way I want to set the execution character set to UTF-8. How to?

Why I want to set the encoding of the execution?

I'm writing an application in C++ which needs to communicate with a db server. And the charset of the tables is utf8. After I build some tests, the tests will catch exceptions thrown around insertion operations on db tables. The exceptions tell me that they meet incorrect string values. I suppose that it's caused by the wrong encoding right? BTW, are there any other ways to handle this issue?

790

asked May 12 '14 11:05

Ggicci

3 Answers

AFAIK, VC++ doesn't have a commandline flag to let you specify a UTF-8 execution character set. However it does (sporadically) support the undocumented

#pragma execution_character_set("utf-8")

referred to here.

To get the effect of a commandline flag with this pragma you can write the pragma in a header file, say, preinclude.h and pre-include this header in every compilation by passing the flag /FI preinclude.h. See this documentation for how to set this flag from the IDE.

The pragma was supported in VC++ 2010, then forgotten in VC++ 2012, and is supported again in VC++ 2013

answered Oct 16 '22 07:10

Mike Kinghan

The Visual Studio 2015 Update 2 and later supports setting the execution character set:

You can use the option /utf-8 which combines the options /source-charset:utf-8 and /execution-charset:utf-8. From the link above:

In those cases where BOM-less UTF-8 files already exist or where changing to a BOM is a problem, use the /source-charset:utf-8 option to correctly read these files.

Use of /execution-charset or /utf-8 can help when targeting code between Linux and Windows as Linux commonly uses BOM-less UTF-8 files and a UTF-8 execution character set.

Project Properties/Configuration Properties/General/Character Set only sets Macros Unicode/MBCS but not the source character set or execution character set.

answered Oct 16 '22 06:10

Roi Danton

It should be noted that the pragma execution_character_set does only seem to apply to character string literals ("Hello World") and not wide character string literals (L"Hello World").

I did some experiments to find out how source and execution character sets are implemented in MSVC. I did the experiments with Visual Studio 2015 on a Windows system where CP_ACP is 1252 and summarize the results as follows:

Character literals

If MSVC determines the source file to be a Unicode file, that is it is encoded in UTF-8 or UTF-16, it converts characters to CP_ACP. If a Unicode character is not within the range of CP_ACP, MSVC issues a C4566 warning ("character represented by universal-character-name '\U0001D575' cannot be represented in the current code page (1252)"). MSVC assumes the execution character set of the compiled software is CP_ACP of the compiler. That implies that you should compile the software under the CP_ACP of the target environment, i.e. if you want to execute the software on a Windows system with code page 1252 you should compile it under code page 1252 and not execute it on a system with any other code page. In practice it might work if your literals are ASCII encoded (C0 Control and Basic Latin Unicode block) since most common SBCS code pages extend this encoding. However, there are some which do not, especially DBCS code pages
If MSVC determines that the source file is not a Unicode file, it interprets the source file according to CP_ACP and assumes that the execution character set is CP_ACP. As with Unicode files you should compile the software under the CP_ACP of the target environment and have the same problems.

All "ANSI" Windows API functions (e.g. CreateFileA) interpret strings of type LPSTR according to CP_ACP or CP_THREAD_ACP (which defaults to CP_ACP). It's not easy to find out which functions use CP_ACP or CP_THREAD_ACP so it's best to never change CP_THREAD_ACP.

Wide character literals

The execution character set for wide character literals is always Unicode and the encoding is UTF-16LE. All wide character Windows API functions (e.g. CreateFile) interpret string of type LPWSTR as UTF-16LE strings. That also implies that wcslen does not return the number of Unicode characters but the number wchar_t characters of a wide character string. UTF-16 is also different from UCS-2 in some cases.

If MSVC determines the source file to be a Unicode file, it converts the characters to UTF-16LE.
If MSVC determines that the source file is not a Unicode file, it reads file according to CP_ACP and extends the characters to two bytes without interpreting them. That is, if a character is encoded as 0xFF in CP_ACP it will be written as 0x00 0xFF regardless of whether the CP_ACP character 0xFF is the Unicode character U+00FF.

I haven't had the chance to repeat my experiments on a DBCS Windows system because I don't speak the languages that usually use such code pages. Perhaps some body can repeat the experiments on such a system.

For me the conclusion of the experiment is that you should avoid character literals, even if you use the execution_character_set pragma.

The pragma just changes how character string literals are encoded in the binary but does not change the execution character set of the libraries you use or the kernel. If you wanted to use the execution_character_set pragma, you would have to recompile Windows and all other libraries you use completely with the pragma which is of course impossible. So I would recommend against using it. It might work for some systems since UTF-8 works with most character string functions in the CRT and CP_ACP usually includes ASCII but you should check whether these assumptions really hold in your target environment and whether the required effort of this misuse is really worth it. Moreover, the pragma seems to be undocumented and I might not work in future releases.

Otherwise you have to compile separate binaries for all code pages that are in use in your target systems. The only way to avoid multiple binaries would be when you externalize all strings to resources which are UTF-16LE encoded and convert the strings to CP_ACP if required. In this case you have to save the resource scripts (.rc files) as UTF-8, invoke rc with /c65001 (UTF-16LE does not work) and include the strings for all code pages that are in use in your target systems.

I would advice to encode your files in a Unicode encoding, such as UTF-8 or UTF-16LE, and use wide character literals if you can't externalize the strings to resources and compile with UNICODE and _UNICODE defined. It's not advisable to use string and character literals anyhow, prefer resources. Use WideCharacterToMultiByte and MultiByteToWideChar for functions which expect strings that are encoded according to CP_ACP or some other code page.

The source encoding detection heuristic of MSVC works best with BOM enabled (even in UTF-8).

I'm not an expert on Asian languages but I read that han unification in Unicode is controversial. So using Unicode might not be the solution to all problems and there might be cases where it doesn't meet the requirements but I would say for the majority languages Unicode is what works best under Windows.

It's a mistake of Microsoft to be not explicit about this and document the behaviour of their compilers and operating system.

answered Oct 16 '22 07:10

user3998276

Related questions
                            
                                Node.js C++ Addon: Threading
                            
                                Class identity without RTTI
                            
                                Structure of arrays and array of structures - performance difference
                            
                                In-order traversal complexity in a binary search tree (using iterators)?
                            
                                Operator overload or comparison function in C++ priority queue
                            
                                Why doesn't C++ cast to const when a const method is public and the non-const one is protected?
                            
                                Seeking istreambuf_iterator <wchar_t> clarifications, reading a complete text file of Unicode characters
                            
                                Invalid conversion from unsigned char* to char*
                            
                                Is it possible to use a lambda function for a template parameter?
                            
                                Issue with std::shared_ptr, inheritance, and template argument deduction
                            
                                std::ostream_iterator prevent last item from using the delimiter [duplicate]
                            
                                Using a template alias instead of a template within a template
                            
                                Passing a numpy pointer (dtype=np.bool) to C++
                            
                                Lambda in header file error
                            
                                Why does having a base class disqualify a class from being aggregate?
                            
                                What is the point of 'protected' in a union in C++
                            
                                How to write a C++ conversion operator returning reference to array?
                            
                                Unordered set of pairs, compilation error
                            
                                C++11 inline lambda functions without template
                            
                                Division by complex<double> in clang++ versus g++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With