What's the rationale for null terminated strings?

in light of the raging squall below:

I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)

The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.

As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".

Advantages (I see) of Zero Terminated Strings:

very simple, no need to introduce new concepts in language, char arrays/char pointers can do.
the core language just include minimal syntaxic sugar to convert something between double quotes to a bunch of chars (really a bunch of bytes). In some cases it can be used to initialize things completely unrelated with text. For instance xpm image file format is a valid C source that contains image data encoded as a string.
by the way, you can put a zero in a string literal, the compiler will just also add another one at the end of the literal: "this\0is\0valid\0C". Is it a string ? or four strings ? Or a bunch of bytes...
flat implementation, no hidden indirection, no hidden integer.
no hidden memory allocation involved (well, some infamous non standard functions like strdup perform allocation, but that's mostly a source of problem).
no specific issue for small or large hardware (imagine the burden to manage 32 bits prefix length on 8 bits microcontrollers, or the restrictions of limiting string size to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
implementation of string manipulation is just a handful of very simple library function
efficient for the main use of strings : constant text read sequentially from a known start (mostly messages to the user).
the terminating zero is not even mandatory, all necessary tools to manipulate chars like a bunch of bytes are available. When performing array initialisation in C, you can even avoid the NUL terminator. Just set the right size. char a[3] = "foo"; is valid C (not C++) and won't put a final zero in a.
coherent with the unix point of view "everything is file", including "files" that have no intrinsic length like stdin, stdout. You should remember that open read and write primitives are implemented at a very low level. They are not library calls, but system calls. And the same API is used for binary or text files. File reading primitives get a buffer address and a size and return the new size. And you can use strings as the buffer to write. Using another kind of string representation would imply you can't easily use a literal string as the buffer to output, or you would have to make it have a very strange behavior when casting it to char*. Namely not to return the address of the string, but instead to return the actual data.
very easy to manipulate text data read from a file in-place, without useless copy of buffer, just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
prepending some int values of whatever size would implies alignment issues. The initial length should be aligned, but there is no reason to do that for the characters datas (and again, forcing alignment of strings would imply problems when treating them as a bunch of bytes).
length is known at compile time for constant literal strings (sizeof). So why would anyone want to store it in memory prepending it to actual data ?
in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.

That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?

EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).

But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.

On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).

The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).

I think, it has historical reasons and found this in wikipedia:

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

Related questions
                            
                                How much faster is C++ than C#?
                            
                                Why the switch statement cannot be applied on strings?
                            
                                Is there any use for unique_ptr with array?
                            
                                How to convert a char array to a string?
                            
                                Write applications in C or C++ for Android? [closed]
                            
                                Why doesn't C++ have a garbage collector?
                            
                                Meaning of = delete after function declaration
                            
                                How can I pad an int with leading zeros when using cout << operator? [duplicate]
                            
                                Forward declaring an enum in C++
                            
                                Determine if map contains a value for a key?
                            
                                Initialization of all elements of an array to one default value in C++?
                            
                                What is the difference between new/delete and malloc/free?
                            
                                Differences between unique_ptr and shared_ptr [duplicate]
                            
                                Sorting a vector of custom objects
                            
                                How exactly is std::string_view faster than const std::string&?
                            
                                Are == and != mutually dependent?
                            
                                What is uintptr_t data type
                            
                                How to retrieve all keys (or values) from a std::map and put them into a vector?
                            
                                Purpose of Unions in C and C++
                            
                                How to sum up elements of a C++ vector?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the rationale for null terminated strings?

Tags:

c++

c

string

null-terminated

People also ask

in light of the raging squall below:

Recent Activity

Donate For Us