Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?

GCC has -finput-charset, -fexec-charset and -fwide-exec-charset three compile options to specify particular encodings involved in a "compile chain". Like the following:

+--------+   -finput-charset     +----------+    -fexec-charset (or)    +-----+
| source | ------------------->  | compiler |  -----------------------> | exe |
+--------+                       +----------+    -fwide-exec-charset    +-----+

Reference: GCC compiler options

I found a question about -finput-charset here: Specification of source charset encoding in MSVC++, like gcc “-finput-charset=CharSet”. But I want to know whether VC has a compiler option like -fexec-charset in GCC to specify the execution character set.

I found a seemed relative option in Visual Studio: Project Properties/Configuration Properties/General/Character Set. And the value is Use Unicode Character Set. Does it do the same thing as -fexec-charset in GCC? In that way I want to set the execution character set to UTF-8. How to?

Why I want to set the encoding of the execution?

I'm writing an application in C++ which needs to communicate with a db server. And the charset of the tables is utf8. After I build some tests, the tests will catch exceptions thrown around insertion operations on db tables. The exceptions tell me that they meet incorrect string values. I suppose that it's caused by the wrong encoding right? BTW, are there any other ways to handle this issue?

like image 790
Ggicci Avatar asked May 12 '14 11:05

Ggicci


People also ask

What character encoding does C use?

That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.

What is execution character set?

The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.

What encoding does GCC use?

The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.


3 Answers

AFAIK, VC++ doesn't have a commandline flag to let you specify a UTF-8 execution character set. However it does (sporadically) support the undocumented

#pragma execution_character_set("utf-8")

referred to here.

To get the effect of a commandline flag with this pragma you can write the pragma in a header file, say, preinclude.h and pre-include this header in every compilation by passing the flag /FI preinclude.h. See this documentation for how to set this flag from the IDE.

The pragma was supported in VC++ 2010, then forgotten in VC++ 2012, and is supported again in VC++ 2013

like image 82
Mike Kinghan Avatar answered Oct 16 '22 07:10

Mike Kinghan


The Visual Studio 2015 Update 2 and later supports setting the execution character set:

You can use the option /utf-8 which combines the options /source-charset:utf-8 and /execution-charset:utf-8. From the link above:

In those cases where BOM-less UTF-8 files already exist or where changing to a BOM is a problem, use the /source-charset:utf-8 option to correctly read these files.

Use of /execution-charset or /utf-8 can help when targeting code between Linux and Windows as Linux commonly uses BOM-less UTF-8 files and a UTF-8 execution character set.

Project Properties/Configuration Properties/General/Character Set only sets Macros Unicode/MBCS but not the source character set or execution character set.

like image 30
Roi Danton Avatar answered Oct 16 '22 06:10

Roi Danton


It should be noted that the pragma execution_character_set does only seem to apply to character string literals ("Hello World") and not wide character string literals (L"Hello World").

I did some experiments to find out how source and execution character sets are implemented in MSVC. I did the experiments with Visual Studio 2015 on a Windows system where CP_ACP is 1252 and summarize the results as follows:

Character literals

  • If MSVC determines the source file to be a Unicode file, that is it is encoded in UTF-8 or UTF-16, it converts characters to CP_ACP. If a Unicode character is not within the range of CP_ACP, MSVC issues a C4566 warning ("character represented by universal-character-name '\U0001D575' cannot be represented in the current code page (1252)"). MSVC assumes the execution character set of the compiled software is CP_ACP of the compiler. That implies that you should compile the software under the CP_ACP of the target environment, i.e. if you want to execute the software on a Windows system with code page 1252 you should compile it under code page 1252 and not execute it on a system with any other code page. In practice it might work if your literals are ASCII encoded (C0 Control and Basic Latin Unicode block) since most common SBCS code pages extend this encoding. However, there are some which do not, especially DBCS code pages

  • If MSVC determines that the source file is not a Unicode file, it interprets the source file according to CP_ACP and assumes that the execution character set is CP_ACP. As with Unicode files you should compile the software under the CP_ACP of the target environment and have the same problems.

All "ANSI" Windows API functions (e.g. CreateFileA) interpret strings of type LPSTR according to CP_ACP or CP_THREAD_ACP (which defaults to CP_ACP). It's not easy to find out which functions use CP_ACP or CP_THREAD_ACP so it's best to never change CP_THREAD_ACP.

Wide character literals

The execution character set for wide character literals is always Unicode and the encoding is UTF-16LE. All wide character Windows API functions (e.g. CreateFile) interpret string of type LPWSTR as UTF-16LE strings. That also implies that wcslen does not return the number of Unicode characters but the number wchar_t characters of a wide character string. UTF-16 is also different from UCS-2 in some cases.

  • If MSVC determines the source file to be a Unicode file, it converts the characters to UTF-16LE.
  • If MSVC determines that the source file is not a Unicode file, it reads file according to CP_ACP and extends the characters to two bytes without interpreting them. That is, if a character is encoded as 0xFF in CP_ACP it will be written as 0x00 0xFF regardless of whether the CP_ACP character 0xFF is the Unicode character U+00FF.

I haven't had the chance to repeat my experiments on a DBCS Windows system because I don't speak the languages that usually use such code pages. Perhaps some body can repeat the experiments on such a system.

For me the conclusion of the experiment is that you should avoid character literals, even if you use the execution_character_set pragma.

The pragma just changes how character string literals are encoded in the binary but does not change the execution character set of the libraries you use or the kernel. If you wanted to use the execution_character_set pragma, you would have to recompile Windows and all other libraries you use completely with the pragma which is of course impossible. So I would recommend against using it. It might work for some systems since UTF-8 works with most character string functions in the CRT and CP_ACP usually includes ASCII but you should check whether these assumptions really hold in your target environment and whether the required effort of this misuse is really worth it. Moreover, the pragma seems to be undocumented and I might not work in future releases.

Otherwise you have to compile separate binaries for all code pages that are in use in your target systems. The only way to avoid multiple binaries would be when you externalize all strings to resources which are UTF-16LE encoded and convert the strings to CP_ACP if required. In this case you have to save the resource scripts (.rc files) as UTF-8, invoke rc with /c65001 (UTF-16LE does not work) and include the strings for all code pages that are in use in your target systems.

I would advice to encode your files in a Unicode encoding, such as UTF-8 or UTF-16LE, and use wide character literals if you can't externalize the strings to resources and compile with UNICODE and _UNICODE defined. It's not advisable to use string and character literals anyhow, prefer resources. Use WideCharacterToMultiByte and MultiByteToWideChar for functions which expect strings that are encoded according to CP_ACP or some other code page.

The source encoding detection heuristic of MSVC works best with BOM enabled (even in UTF-8).

I'm not an expert on Asian languages but I read that han unification in Unicode is controversial. So using Unicode might not be the solution to all problems and there might be cases where it doesn't meet the requirements but I would say for the majority languages Unicode is what works best under Windows.

It's a mistake of Microsoft to be not explicit about this and document the behaviour of their compilers and operating system.

like image 2
user3998276 Avatar answered Oct 16 '22 07:10

user3998276