Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the problems of a zero-terminated string that length-prefixed strings overcome?

Tags:

c++

c

string

People also ask

What does zero terminated mean?

It's a reserved value to indicate the end of a sequence of (for example) characters in a string. More correctly known as null (or NUL) terminated. This is because the value used is zero, rather than being the character code for '0'.

What is termination of a string?

All character strings are terminated with a null character. The null character indicates the end of the string. Such strings are called null-terminated strings. The null terminator of a multibyte string consists of one byte whose value is 0.

What is a zero terminated array?

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character (a character with a value of zero, called NUL in this article).

What is the reason C string terminates with the \0 character?

C strings are null-terminated. That is, they are terminated by the null character, NUL . They are not terminated by the null pointer NULL , which is a completely different kind of value with a completely different purpose. NUL is guaranteed to have the integer value zero.


One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:

char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");

On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.

With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.

There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.


One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.

Length-prefixed strings do not suffer that limitation.


First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.

C-style strings end with a 0 character for historical reasons.

The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).


It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.

  1. Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
  2. Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.

  3. Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.

  4. Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.


A few more bonus features that can be implemented with length-prefixed strings:

  1. It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.

  2. One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:

    1. An N-byte string

    2. (Optional) An N-byte buffer holding a zero-length string

    3. An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.

    Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.

  3. One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.

  4. One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".

I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.