Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get byte size of multibyte string

How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?

Or, more general, how do I get the right byte size of a TCHAR string?

Solution:

_tcslen(_T("TCHAR string")) * sizeof(TCHAR)

EDIT:
I was talking about null-terminated strings only.

like image 498
flacs Avatar asked Jul 28 '10 23:07

flacs


People also ask

How do you calculate bytes of a string?

So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

How long is a byte string?

A byte is a bit string of length 8. There are 256 different bytes. (Why?) A byte can represent an unsigned integer between 0 and 255, a signed integer between -128 and +127, or a keyboard character such as a letter, digit, or punctuation mark.

What is multibyte data type?

A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.


2 Answers

Let's see if I can clear this up:

"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.

Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:

text: t  h  é     \0
mem:  74 68 c3 a9 00

This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:

struct my_string
{
    size_t length;
    char *data;
};

... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)

For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.

Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:

text:   t      h      é      \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem:    74 00  68 00  e9 00  00 00

That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.

Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.

like image 115
Thanatos Avatar answered Sep 29 '22 15:09

Thanatos


According to MSDN, _tcslen corresponds to strlen when _MBCS is defined. strlen will return the number of bytes in the string. If you use _tcsclen that corresponds to _mbslen which returns the number of multibyte characters.

Also, multibyte strings do not (AFAIK) contain embedded nulls, no.

I would question the use of a multibyte encoding in the first place, though... unless you're supporting a legacy app, there's no reason to choose multibyte over Unicode.

like image 40
Dean Harding Avatar answered Sep 29 '22 16:09

Dean Harding