Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you implement your string type?

Tags:

string

Assume you are designing and implementing a new language from scratch, though you may freely borrow ideas from existing languages/implementations.

Question: If a programmer declares a string variable (assume strongly typed), how would you choose to store this variable in memory?

There are many use cases, but do you have a particular model that is superior in certain areas? Is your string mutable? Is it mutable, but only to a certain length that isn't the end of memory? Can I dynamically set the length, or can it only be done at compile time? Is it easy to access the 'nth' element? Does the string require a contiguous sector of memory? Could it be broken up into smaller strings?

Certain things to consider that programmers might like to do with your string: Calculating the length. Adding to the string. Extracting parts of the string (substrings). Applying Regex. Converting to a different value (number, boolean, etc)

EDIT: Clarifying what I mean.

If a user declares the following:

var Name : string

How would you choose, as the language designer, how to store this in RAM? What are the advantages and disadvantages of your method, etc.

like image 1000
Coltin Avatar asked Feb 19 '09 17:02

Coltin


2 Answers

If I were writing a language from the ground up, I would want both mutable and immutable string types defined. Immutability makes string-handling operations much faster, but creates serious limitations, particularly when it comes to concatenation and the like.

The immutable string, I would store as a null-terminated array of unicode values. The mutable string, I would store as a linked list of unicode chars for easier reshuffling, slicing, etc.

like image 163
Yes - that Jake. Avatar answered Jan 01 '23 08:01

Yes - that Jake.


I'd avoid C strings. Computing the length is O(n). Sharing substrings is nearly impossible. The contiguous memory requirement leads to fragmentation. Any problem with the terminator leads to bugs and security holes. If you store it as UCS-4, you waste a lot of space for ASCII strings (and lose C compatibility, the one benefit of C strings); if you store it as UTF-8, indexing is O(n). PDP-11's ASCIZ type only really made a lot of sense when you're writing a library for ASCII on a bare PDP-11.

Languages younger than the PDP-11 often use a different structure:

  • Pascal uses a length field instead of a terminator -- their strlen() is O(1).
  • Forth uses (address, length) doubles -- their strlen() is O(1), plus they can easily share substrings.
  • Many modern "managed" languages like Java store the length separately, as well.
  • In other languages (like Common Lisp), strings are simply a subtype of vector (whose elements are characters).
  • The Excel team used C, but implemented their own Pascal strings for performance.

I'd use something like ropes. Concatenation is constant-time. They don't require contiguous memory. Substring sharing is easy. All operations can be performed without locking in a multithreaded environment. Maybe allow both UCS-4 and ASCII nodes to make storage more compact in the common case, and/or have it automatically use a simpler structure internally for really short strings.

ASCIZ is great if you have little memory, short strings, 7-bit chars, trusted input, and your CPU is so slow that it's worth your programmer-time to be really careful. In the modern world of Unicode, multithreading, efficient GC, fast CPUs, and big (possibly untrusted) inputs, it's no longer a great choice.

like image 44
Ken Avatar answered Jan 01 '23 09:01

Ken