Is string interning really useful?

Tags:

I was having a conversation about strings and various languages a while back, and the topic of string interning came up. Apparently Java and the .NET framework do this automatically with all strings, as well as several scripting languages. Theoretically, it saves memory because you don't end up with multiple copies of the same string, and it saves time because string equality comparisons are a simple pointer comparison instead of an O(N) run through each character of the string.

But the more I think about it, the more skeptical I grow of the concept's benefits. It seems to me that the advantages are mostly theoretical:

First off, to use automatic string interning, all strings must be immutable, which makes a lot of string processing tasks harder than they need to be. (And yes, I've heard all the arguments for immutability in general. That's not the point.)
Every time a new string is created, it has to be checked against the string interning table, which is at least a O(N) operation. (EDIT: Where N is the size of the string, not the size of the table, since this was confusing people.) So unless the ratio of string equality comparisons to new string creation is pretty high, it's unlikely that the net time saved is a positive value.
If the string equality table uses strong references, the strings will never get garbage collected when they're no longer needed, thus wasting memory. On the other hand, if the table uses weak references, then the string class requires some sort of finalizer to remove the string from the table, thus slowing down the GC process. (Which could be pretty significant, depending on how the string intern table is implemented. Worst case, deleting an item from a hash table can require an O(N) rebuild of the entire table under certain circumstances.)

This is just the result of me thinking about implementation details. Is there something I've missed? Does string interning actually provide any significant benefits in the general case?

EDIT 2: All right, apparently I was operating from a mistaken premise. The person I was talking to never pointed out that string interning was optional for newly-created strings, and in fact gave the strong impression that the opposite was true. Thanks to Jon for setting the matter straight. Another accepted answer for him.

212

asked Jul 11 '11 17:07

Mason Wheeler

4 Answers

No, Java and .NET don't do it "automatically with all strings". They (well, Java and C#) do it with constant string expressions expressed in bytecode/IL, and on demand via the String.intern and String.Intern (.NET) methods. The exact situation in .NET is interesting, but basically the C# compiler will guarantee that every reference to an equal string constant within an assembly ends up referring to the same string object. That can be done efficiently at type initialization time, and can save a bunch of memory.

It doesn't happen every time a new string is created.

(On the string immutability front, I for one am extremely glad that strings are immutable. I don't want to have to take a copy every time I receive a parameter etc, thank you very much. I haven't seen it make string processing tasks harder, either...)

And as others have pointed out, looking up a string in a hash table isn't generally an O(n) operation, unless you're incredibly unlucky with hash collisions...

Personally I don't use string interning in user-land code; if I want some sort of cache of strings I'll create a HashSet<string> or something similar. That can be useful in various situations where you expect to come across the same strings several times (e.g. XML element names) but with a simple collection you don't pollute a system-wide cache.

172

answered Oct 22 '22 04:10

Jon Skeet

First off, to use automatic string interning, all strings must be immutable, which makes a lot of string processing tasks harder than they need to be. (And yes, I've heard all the arguments for immutability in general. That's not the point.)

This is true and string are immutable in Java. I am not sure if this a bad thing. Without going in to immutable vs mutable, I like to think this is a great design because of caching and so much more simplicity that I won't get in to.

Every time a new string is created, it has to be checked against the string interning table, which is at least a O(N) operation. So unless the ratio of string equality comparisons to new string creation is pretty high, it's unlikely that the net time saved is a positive value.

Not exactly O(n). You can do hashmaps and/or other data structures that will bring this to near constant look up.

If the string equality table uses strong references, the strings will never get garbage collected when they're no longer needed, thus wasting memory. On the other hand, if the table uses weak references, then the string class requires some sort of finalizer to remove the string from the table, thus slowing down the GC process. (Which could be pretty significant, depending on how the string intern table is implemented. Worst case, deleting an item from a hash table can require an O(N) rebuild of the entire table under certain circumstances.)

You are right about this and I would agree with you. Except I feel that the GC processing and negligible. The benefits in the long run are much more useful than having a garbage collector doing an extra check. I am not sure what you mean about O(n) for deleting from hashtable. Most operations on hashtables are O(1)

So in summary, I think your assumption that most operation are linear. But looking up strings is closer to a constant time. Therefore, this approach will have negligible performance loss but a huge memory gain. Which I'd argue is worth it.

Here is a nice quote on what is actually happening and how it saves memory.

To save memory (and speed up testing for equality), Java supports “interning” of Strings. When the intern() method is invoked on a String, a lookup is performed on a table of interned Strings. If a String object with the same content is already in the table, a reference to the String in the table is returned. Otherwise, the String is added to the table and a reference to it is returned.

answered Oct 22 '22 05:10

Amir Raminfar

Here's the python documentation's take on it:

sys.intern(string)

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

answered Oct 22 '22 06:10

Steven Rumbalski

The a.equals(b) is very fast for random strings. Its is only slow for Strings which are long and the same (or almost the same)

Click to copy

Random rand = new Random(1);
String[] list = new String[2000];
for(int i=0;i<list.length;i++)
    list[i] = "1234567"+Long.toString(rand.nextInt(36*37), 36); // semi random
int count = 0;
long start = System.nanoTime();
for(int i=0;i<list.length;i++)
    for(int j=0;j<list.length;j++)
        if (list[i].equals(list[j]))
            count++;
long time = System.nanoTime() - start;
System.out.printf("The average time for equals() was %,d ns.%n", time/list.length/list.length);

on a 2.3 GHz laptop prints

Click to copy

The average time for equals() was 19 ns.

If you intern() the first value and have to intern() one value to do the comparison

Click to copy

       if (list[i] == list[j].intern())

prints

Click to copy

The average time for equals() was 258 ns.

This is a common case as you often have one value you know is interned and a second which is input and is not intern'ed.

if you only use intern'ed Strings and == it, and don't count the cost, prints

Click to copy

The average time for equals() was 4 ns.

Which is many times faster if you are doing millions of comparison. However for a small number of comparisons, you are saving 8 ns but could be costing 250 ns more.

It may just be simpler to avoid intern() and use equals().

answered Oct 22 '22 04:10

Peter Lawrey

Related questions
                            
                                What is character 0x1f?
                            
                                PostgreSQL - GROUP BY timestamp values?
                            
                                When to use a key-value store for web development?
                            
                                When does it make sense to use a public field?
                            
                                else & elif statements not working in Python
                            
                                php function how to set default value as object?
                            
                                Java: How to call super method from inner in-place class
                            
                                How to scroll a select list with JavaScript or jQuery?
                            
                                UITableview cells height is not working in a empty table?
                            
                                Git branches & commits history after merge
                            
                                How to test for blank line with Java Scanner?
                            
                                What exception to throw when a property setter is not allowed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is string interning really useful?

Tags:

java

python

.net

ruby

string-interning

Mason Wheeler

People also ask

4 Answers

Jon Skeet

Amir Raminfar

Steven Rumbalski

Peter Lawrey

Recent Activity

Donate For Us