Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 8 String deduplication vs. String.intern()

I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete.

I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern your strings (one obvious one is the advantage of not having to pollute your code with calls to intern())?

This is especially interesting considering that Oracle might make G1GC the default GC in java 9

like image 862
Hilikus Avatar asked Sep 29 '15 22:09

Hilikus


People also ask

What is intern () in string?

The method intern() creates an exact copy of a String object in the heap memory and stores it in the String constant pool. Note that, if another String with the same contents exists in the String constant pool, then a new object won't be created and the new reference will point to the other String.

What is the use of the intern () method?

The intern() method creates an exact copy of a String that is present in the heap memory and stores it in the String constant pool. Takeaway - intern() method is used to store the strings that are in the heap in the string constant pool if they are not already present.

What is string deduplication?

String Deduplication is a Java feature that helps you to save memory occupied by duplicate String objects in Java applications. It reduces the memory footprint of String objects in Java heap memory by making the duplicate or identical String values share the same character array.

Does Java automatically intern strings?

The single copy of each string is called its intern and is typically looked up by a method of the string class, for example String. intern() in Java. All compile-time constant strings in Java are automatically interned using this method.


Video Answer


2 Answers

With this feature, if you have 1000 distinct String objects, all with the same content "abc", JVM could make them share the same char[] internally. However, you still have 1000 distinct String objects.

With intern(), you will have just one String object. So if memory saving is your concern, intern() would be better. It'll save space, as well as GC time.

However, the performance of intern() isn't that great, last time I heard. You might be better off by having your own string cache, even using a ConcurrentHashMap ... but you need to benchmark it to make sure.

like image 130
ZhongYu Avatar answered Oct 18 '22 13:10

ZhongYu


As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/. It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended!

The main factor probably depends on what you are in control over:

  • Do you have full control over the choice of GC? In a GUI application for example, there is still a strong case to be made for using Serial GC. (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, e.g. after a transient spike in usage). So you might pick that or give your users the option. (If the heap remains small the pauses should not be a big deal).

  • Do you have full control over the code? The G1GC option is great for 3rd party libraries (and applications!) which you can't edit.

The second consideration (as per @ZhongYu's answer) is that String.intern can de-duplication the String objects themselves, whereas G1GC necessarily can only de-duplicate their private char[] field.

A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. G1GC will run an extra thread dedicated to de-duplicating the heap. For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. (Of course there may be a comparable overhead if you call String.intern everywhere, which would also runs in serial, but then...)

You probably don't need string de-duplication everywhere. There are probably only certain areas of code that:

  • really impact long-term heap usage, and
  • create a high proportion of duplicate strings

By using String.intern selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price.

And finally, a quick plug for the Guava utility: Interner, which:

Provides equivalent behavior to String.intern() for other immutable types

You can also use that for Strings. Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern(), even after tuning the jvm options. (And bonus: you don't need to tune the JVM options to scale to different input.)

like image 26
Luke Usherwood Avatar answered Oct 18 '22 13:10

Luke Usherwood