Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros and cons of immutable strings

Some languages (C# or Java) have immutable strings while others (e.g. Ruby) have mutable ones. What are the reasons behind those desgin choices?

like image 287
synapse Avatar asked Aug 07 '12 21:08

synapse


People also ask

What is the advantages and disadvantages of string immutability?

Immutable strings are cheap to copy, because you don't need to copy all the data - just copy a reference or pointer to the data. Immutable classes of any kind are easier to work with in multiple threads, the only synchronization needed is for destruction. Save this answer.

What are the disadvantages of immutability?

The only real disadvantage of immutable classes is that they require a separate object for each distinct value. Creating these objects can be costly, especially if they are large.

Should strings be immutable?

If the String doesn't remain immutable, any hacker can cause a security issue in the application by changing the reference value. The String is safe for multithreading because of its immutableness. Different threads can access a single “String instance”.


3 Answers

One reason why immutable strings are good is that it makes Unicode support easier. Modern Unicode can no longer fit efficiently into a fixed-size data cell, which kills the one-to-one correspondence between string index and memory address which gives mutable strings their advantage.


In the past, most Western applications used single-byte characters (various ASCII-based codings, or EBCDIC...), so you could usually handle them efficiently by treating strings as byte buffers (as in traditional C applications).

When Unicode was fairly new, there wasn't much requirement for anything outside the first 16 bits, so Java used double-byte characters for its Strings (and StringBuffers). This used twice the memory, and ignored any problems that might occur from Unicode extensions beyond 16 bits, but it was convenient at the time.

Now Unicode is not so new, and while the most-used characters still fit in 16 bits, you can't really get away with pretending the Basic Multilingual Plane is all that exists. If you want to honestly claim Unicode support, you need either variable-length characters or even larger (32-bit?) character cells.

With variable-length characters, you can no longer index into an arbitrary-length string in O(1) time -- barring additional information, you need to count from the beginning to figure out what the N'th character is. This also kills the main advantage of mutable string buffers: the ability to seamlessly modify substrings in place.

Fortunately, most string manipulation doesn't actually need this modify-in-place ability. Lexing, parsing, and search all proceed on a sequential, iterative basis, from beginning to end. General search-and-replace was never in-place to begin with, since the replacement string doesn't have to be the same length as the original.


Concatenating large numbers of substrings doesn't actually need modify-in-place to be efficient, either. You do need to be more careful about it, though, since (as others have pointed out) a naive concatenation loop can easily be O(N^2) by allocating a new string for each of N partial substrings...

One way to avoid naive concatenation is to provide a mutable StringBuffer or ConcatBuffer object designed to do concatenation efficiently. Another way would be to include an immutable string constructor that takes an iterator into a sequence of strings to be (efficiently) concatenated.

But, more generally, it is possible to write an immutable string library that efficiently concatenates by reference. This kind of string is often called a "rope" or "cord" to suggest that it is at least a bit more heavyweight than the basic strings it's composed of, but for concatenation purposes it is much more efficient, since it doesn't need to recopy the data at all!

The above Wikipedia link says that "rope" datastructures are O(log N) to concatenate, but the seminal paper "Purely Functional Data Structures" by Okasaki shows how to do concatenation in O(1) time.

like image 164
comingstorm Avatar answered Sep 28 '22 07:09

comingstorm


At least in the case of Java, part of the reason for making strings immutable was for security and thread-safety. Java places a premium on runtime safety (it was originally designed to allow set-top boxes and web browsers to download and execute remote content without compromising the host system). To help increase security, strings are immutable and cannot be subclassed. This means that the Java runtime can pass around and receive strings from the user while guaranteeing that the string's value will remain constant (that is, an attacker can't subclass the string, pass in what looks like a valid string into a function, but then change the value later on to gain access to the wrong data, or alternatively use multiple threads so that a string looks correct at one point, but then is mutated later on).

Additionally, immutability carries efficiency advantages in multithreaded systems, since no locking has to be done on the string. It also makes it possible to easily implement substring operations, since many strings can share the same underlying array of characters, though with different start and end points.

like image 42
templatetypedef Avatar answered Sep 28 '22 05:09

templatetypedef


If you think about it, all the fundamental data types are immutable. You don't change integer 10 into 11, you replace 10 with 11. Making strings fundamental, and immutable, allows pooling and other optimizations that would not be possible otherwise.

like image 45
ddyer Avatar answered Sep 28 '22 07:09

ddyer