I'm having some trouble figuring out the exact semantics of std::string.length()
.
The documentation explicitly points out that length()
returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.
In particular, is this only relevant to non-char instantiations of std::basic_string<>
or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length()
to be UTF8-aware?
A string is composed of: An 8-byte object header (4-byte SyncBlock and a 4-byte type descriptor)
In C++, string length really represents the number of bytes used to encode the given string. Since one byte in C++ usually maps to one character, this metric mostly means “number of characters,” too.
Use the strlen() function provided by the C standard library string. h header file. char name[7] = "Flavio"; strlen(name); This function will return the length of a string as an integer value.
std::string::size Returns the length of the string, in terms of bytes. This is the number of actual bytes that conform the contents of the string, which is not necessarily equal to its storage capacity.
When dealing with non-char
instantiations of std::basic_string<>
, sure, length may not equal number of bytes. This is particularly evident with std::wstring
:
std::wstring ws = L"hi";
cout << ws.length(); // <-- 2, not 4
But std::string
is about char
characters; there is no such thing as a multi-byte character as far as std::string
is concerned, whether you crammed one in at a high level or not. So, std::string.length()
is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into an std::string
, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.
If we are talking specifically about std::string
, then length()
does return the number of bytes.
This is because a std::string
is a basic_string
of char
s, and the C++ Standard defines the size of one char
to be exactly one byte.
Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.
EDIT: The Standard does say that an implementation shall provide a definition for CHAR_BIT
which says how many bits are in a byte.
By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.
A std::string
is std::basic_string<char>
, so s.length() * sizeof(char) = byte length
. Also, std::string
knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.
If you have UTF-8 data in a std::string
, you'll need to use something else such as ICU to get the "real" length.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With