How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?
Or, more general, how do I get the right byte size of a TCHAR string?
Solution:
_tcslen(_T("TCHAR string")) * sizeof(TCHAR)
EDIT:
I was talking about null-terminated strings only.
So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.
A byte is a bit string of length 8. There are 256 different bytes. (Why?) A byte can represent an unsigned integer between 0 and 255, a signed integer between -128 and +127, or a keyboard character such as a letter, digit, or punctuation mark.
A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.
Let's see if I can clear this up:
"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.
Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:
text: t h é \0
mem: 74 68 c3 a9 00
This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:
struct my_string
{
size_t length;
char *data;
};
... and a slew of functions to help deal with that. (This is sort of how std::string
works, quite roughly.)
For null-terminated strings, however, strlen()
will compute their size in bytes, not characters. (There are other functions for counting characters) strlen
just counts the number of bytes before it sees a 0 byte -- nothing fancy.
Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:
text: t h é \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem: 74 00 68 00 e9 00 00 00
That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen
, which looks at it as 2-byte short
s, not single bytes.
Lastly, you have TCHAR
s, which are one of the above two cases, depending on if UNICODE
is defined. _tcslen
will be the appropriate function (either strlen
or wcslen
), and TCHAR
will be either char
or wchar_t
. TCHAR
was created to ease the move to UTF-16 in the Windows world.
According to MSDN, _tcslen
corresponds to strlen
when _MBCS
is defined. strlen
will return the number of bytes in the string. If you use _tcsclen
that corresponds to _mbslen
which returns the number of multibyte characters.
Also, multibyte strings do not (AFAIK) contain embedded nulls, no.
I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With