I'm writing a language interpreter in C, and my string
type contains a length
attribute, like so:
struct String { char* characters; size_t length; };
Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:
Bounds-checking is built-in if you use "length" instead of looking for a null.
You have to traverse the entire string to find its length.
You have to do extra stuff to handle a null character in the middle of a null-terminated string.
Null-terminated strings deal poorly with Unicode.
Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.
String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin
and end
to both functions).
struct String slice(struct String in, size_t begin, size_t end) { struct String out; out.characters = in.characters + begin; out.length = end - begin; return out; } char* slice(char* in, size_t begin, size_t end) { char* out = malloc(end - begin + 1); for(int i = 0; i < end - begin; i++) out[i] = in[i + begin]; out[end - begin] = '\0'; return out; }
After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!
So my question is: are there any benefits to null-termination that I'm missing?
Character encodings Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible ASCII or UTF-8 string. However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings.
The null character indicates the end of the string. Such strings are called null-terminated strings. The null terminator of a multibyte string consists of one byte whose value is 0. The null terminator of a wide-character string consists of one gl_wchar_t character whose value is 0.
Many library functions accept a string or wide string argument with the constraint that the string they receive is properly null-terminated. Passing a character sequence or wide character sequence that is not null-terminated to such a function can result in accessing memory that is outside the bounds of the object.
A null-terminated string is a sequence of ASCII characters, one to a byte, followed by a zero byte (a null byte). null-terminated strings are common in C and C++.
From Joel's Back to Basics:
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With