GCC has -finput-charset
, -fexec-charset
and -fwide-exec-charset
three compile options to specify particular encodings involved in a "compile chain". Like the following:
+--------+ -finput-charset +----------+ -fexec-charset (or) +-----+
| source | -------------------> | compiler | -----------------------> | exe |
+--------+ +----------+ -fwide-exec-charset +-----+
Reference: GCC compiler options
I found a question about -finput-charset
here: Specification of source charset encoding in MSVC++, like gcc “-finput-charset=CharSet”. But I want to know whether VC
has a compiler option like -fexec-charset
in GCC to specify the execution character set.
I found a seemed relative option in Visual Studio: Project Properties/Configuration Properties/General/Character Set
. And the value is Use Unicode Character Set
. Does it do the same thing as -fexec-charset
in GCC? In that way I want to set the execution character set to UTF-8. How to?
I'm writing an application in C++ which needs to communicate with a db server. And the charset of the tables is utf8. After I build some tests, the tests will catch exceptions thrown around insertion operations on db tables. The exceptions tell me that they meet incorrect string values. I suppose that it's caused by the wrong encoding right? BTW, are there any other ways to handle this issue?
That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.
The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.
The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.
AFAIK, VC++ doesn't have a commandline flag to let you specify a UTF-8 execution character set. However it does (sporadically) support the undocumented
#pragma execution_character_set("utf-8")
referred to here.
To get the effect of a commandline flag with this pragma you can write the pragma in a header
file, say, preinclude.h
and pre-include this header in every compilation by passing
the flag /FI preinclude.h
. See this documentation
for how to set this flag from the IDE.
The pragma was supported in VC++ 2010, then forgotten in VC++ 2012, and is supported again in VC++ 2013
The Visual Studio 2015 Update 2 and later supports setting the execution character set:
You can use the option /utf-8
which combines the options /source-charset:utf-8
and /execution-charset:utf-8
. From the link above:
In those cases where BOM-less UTF-8 files already exist or where changing to a BOM is a problem, use the /source-charset:utf-8 option to correctly read these files.
Use of /execution-charset or /utf-8 can help when targeting code between Linux and Windows as Linux commonly uses BOM-less UTF-8 files and a UTF-8 execution character set.
Project Properties/Configuration Properties/General/Character Set
only sets Macros Unicode/MBCS but not the source character set or execution character set.
It should be noted that the pragma execution_character_set
does only seem to apply to character string literals ("Hello World"
) and not wide character string literals (L"Hello World"
).
I did some experiments to find out how source and execution character sets are implemented in MSVC. I did the experiments with Visual Studio 2015 on a Windows system where CP_ACP
is 1252 and summarize the results as follows:
Character literals
If MSVC determines the source file to be a Unicode file, that is it is encoded in UTF-8 or UTF-16, it converts characters to CP_ACP
. If a Unicode character is not within the range of CP_ACP
, MSVC issues a C4566 warning ("character represented by universal-character-name '\U0001D575' cannot be represented in the current code page (1252)"). MSVC assumes the execution character set of the compiled software is CP_ACP
of the compiler. That implies that you should compile the software under the CP_ACP
of the target environment, i.e. if you want to execute the software on a Windows system with code page 1252 you should compile it under code page 1252 and not execute it on a system with any other code page. In practice it might work if your literals are ASCII encoded (C0 Control and Basic Latin Unicode block) since most common SBCS code pages extend this encoding. However, there are some which do not, especially DBCS code pages
If MSVC determines that the source file is not a Unicode file, it interprets the source file according to CP_ACP
and assumes that the execution character set is CP_ACP
. As with Unicode files you should compile the software under the CP_ACP
of the target environment and have the same problems.
All "ANSI" Windows API functions (e.g. CreateFileA
) interpret strings of type LPSTR
according to CP_ACP
or CP_THREAD_ACP
(which defaults to CP_ACP
). It's not easy to find out which functions use CP_ACP
or CP_THREAD_ACP
so it's best to never change CP_THREAD_ACP
.
Wide character literals
The execution character set for wide character literals is always Unicode and the encoding is UTF-16LE. All wide character Windows API functions (e.g. CreateFile
) interpret string of type LPWSTR
as UTF-16LE strings. That also implies that wcslen
does not return the number of Unicode characters but the number wchar_t
characters of a wide character string. UTF-16 is also different from UCS-2 in some cases.
CP_ACP
and extends the characters to two bytes without interpreting them. That is, if a character is encoded as 0xFF
in CP_ACP
it will be written as 0x00 0xFF
regardless of whether the CP_ACP
character 0xFF
is the Unicode character U+00FF
.I haven't had the chance to repeat my experiments on a DBCS Windows system because I don't speak the languages that usually use such code pages. Perhaps some body can repeat the experiments on such a system.
For me the conclusion of the experiment is that you should avoid character
literals, even if you use the execution_character_set
pragma.
The pragma just changes how character string literals are encoded in the binary but does not change the execution character set of the libraries you use or the kernel. If you wanted to use the execution_character_set
pragma, you would have to recompile Windows and all other libraries you use completely with the pragma which is of course impossible. So I would recommend against using it. It might work for some systems since UTF-8 works with most character string functions in the CRT and CP_ACP
usually includes ASCII but you should check whether these assumptions really hold in your target environment and whether the required effort of this misuse is really worth it. Moreover, the pragma seems to be undocumented and I might not work in future releases.
Otherwise you have to compile separate binaries for all code pages that are in use in your target systems. The only way to avoid multiple binaries would be when you externalize all strings to resources which are UTF-16LE encoded and convert the strings to CP_ACP
if required. In this case you have to save the resource scripts (.rc
files) as UTF-8, invoke rc
with /c65001
(UTF-16LE does not work) and include the strings for all code pages that are in use in your target systems.
I would advice to encode your files in a Unicode encoding, such as UTF-8 or UTF-16LE, and use wide character literals if you can't externalize the strings to resources and compile with UNICODE
and _UNICODE
defined. It's not advisable to use string and character literals anyhow, prefer resources. Use WideCharacterToMultiByte
and MultiByteToWideChar
for functions which expect strings that are encoded according to CP_ACP
or some other code page.
The source encoding detection heuristic of MSVC works best with BOM enabled (even in UTF-8).
I'm not an expert on Asian languages but I read that han unification in Unicode is controversial. So using Unicode might not be the solution to all problems and there might be cases where it doesn't meet the requirements but I would say for the majority languages Unicode is what works best under Windows.
It's a mistake of Microsoft to be not explicit about this and document the behaviour of their compilers and operating system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With