How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.
So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.
A byte is a bit string of length 8. There are 256 different bytes. (Why?) A byte can represent an unsigned integer between 0 and 255, a signed integer between -128 and +127, or a keyboard character such as a letter, digit, or punctuation mark.
A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.
Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t  h  é     \0
mem:  74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
    size_t length;
    char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)
For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text:   t      h      é      \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem:    74 00  68 00  e9 00  00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.
Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.
According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With