For below code in C:
char s[] = "这个问题";
printf("%s", s);
Knew that source file is "UTF-8 Unicode C program text" with file
command.
How the string is coded after compile? Also utf-8 in the .out file?
When the binary file executed in bash, how the string is coded in memory? Is it also utf-8?
Then, how bash knows the coding scheme and show right character?
Last, now the bash know what to show, but how bytes translated to pixels on the screen? Is there some mapping from bytes to pixels?
In all these processes, is there any encoding or decoding of utf-8?
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Assuming GCC, this manual page says that the preprocessor will first translate the character set of the incoming files to the so called source character set, which for gcc is UTF-8. So for an UTF-8 file, nothing happens. The default execution character set is then used for string constants, and that is (again, for GCC) UTF-8 by default.
So your UTF-8 string "survives" and exists in the executable as a bunch of bytes in UTF-8 encoding.
The terminal also has a character set, and that has to match, the C program does nothing to further translate strings when printed, they're just printed as they are, byte for byte. If the terminal isn't set for UTF-8, you will just get garbage.
As I noted in a comment, bash has nothing to do with this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With