I used to be confusing quite a while :
Confusion on Unicode and Multibyte Articles
After reading up the comments by all contributors, plus :
Looking at an old article (Year 2001) : http://www.hastingsresearch.com/net/04-unicode-limitations.shtml, which talk about unicode :
being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to over 170,000 characters.
and Looking at current "modern" article : http://en.wikipedia.org/wiki/Unicode
The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2).
It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)
I try to verify this by running the following code under VC2008
#include <iostream>
int main()
{
// Use unicode encoded in UCS-2?
std::cout << sizeof(L"我爱你") << std::endl;
// Use unicode encoded in UCS-2?
std::cout << sizeof(L"abc") << std::endl;
getchar();
// Compiled using options Character Set : Use Unicode Character Set.
// print out 8, 8
// Compiled using options Character Set : Multi-byte Character Set.
// print out 8, 8
}
It seems that during compilation with Unicode Character Set options, the outcome matched my assumption.
But what about Multi-byte Character Set? What does Multi-byte Character Set means in current "modern" world? :)
UTF-8. UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
UTF stands for "UCS (Unicode) Transformation Format". The UTF-8 encoding can be used to represent any Unicode character. Depending on a Unicode character's numeric value, the corresponding UTF-8 character is a 1, 2, or 3 byte sequence. Table 1 shows the mapping between Unicode and UTF-8.
http://en.wikipedia.org/wiki/Multi-byte_character_set
MBCS is a term used to denote a class of character encodings with characters that cannot be represented with a single byte, hence multi-byte character set. In order to properly decode a string in this format, you need a codepage that tells you various byte combinations map to characters. ISO/IEC 8859 defines a set of MBCS standards, but according to Wikipedia, ISO stopped maintaining them in 2004, presumably to focus on Unicode.
So I guess the modern term for MBCS is "deprecated in favor of Unicode".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With