Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why null-terminated strings? Or: null-terminated vs. characters + length storage

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String {     char* characters;     size_t length; }; 

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end) {     struct String out;     out.characters = in.characters + begin;     out.length = end - begin;      return out; }  char* slice(char* in, size_t begin, size_t end) {     char* out = malloc(end - begin + 1);      for(int i = 0; i < end - begin; i++)         out[i] = in[i + begin];      out[end - begin] = '\0';      return out; } 

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

like image 416
Imagist Avatar asked Aug 10 '09 05:08

Imagist


People also ask

Why do strings need to be null-terminated?

Character encodings Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible ASCII or UTF-8 string. However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings.

What is the use of terminating NULL character?

The null character indicates the end of the string. Such strings are called null-terminated strings. The null terminator of a multibyte string consists of one byte whose value is 0. The null terminator of a wide-character string consists of one gl_wchar_t character whose value is 0.

What happens if you don't null terminate a string?

Many library functions accept a string or wide string argument with the constraint that the string they receive is properly null-terminated. Passing a character sequence or wide character sequence that is not null-terminated to such a function can result in accessing memory that is outside the bounds of the object.

How is a null-terminated string arranged in memory?

A null-terminated string is a sequence of ASCII characters, one to a byte, followed by a zero byte (a null byte). null-terminated strings are common in C and C++.


1 Answers

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

like image 149
weiqure Avatar answered Sep 28 '22 18:09

weiqure