I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern()
obsolete.
I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern
your strings (one obvious one is the advantage of not having to pollute your code with calls to intern()
)?
This is especially interesting considering that Oracle might make G1GC the default GC in java 9
The method intern() creates an exact copy of a String object in the heap memory and stores it in the String constant pool. Note that, if another String with the same contents exists in the String constant pool, then a new object won't be created and the new reference will point to the other String.
The intern() method creates an exact copy of a String that is present in the heap memory and stores it in the String constant pool. Takeaway - intern() method is used to store the strings that are in the heap in the string constant pool if they are not already present.
String Deduplication is a Java feature that helps you to save memory occupied by duplicate String objects in Java applications. It reduces the memory footprint of String objects in Java heap memory by making the duplicate or identical String values share the same character array.
The single copy of each string is called its intern and is typically looked up by a method of the string class, for example String. intern() in Java. All compile-time constant strings in Java are automatically interned using this method.
With this feature, if you have 1000 distinct String objects, all with the same content "abc"
, JVM could make them share the same char[]
internally. However, you still have 1000 distinct String
objects.
With intern()
, you will have just one String
object. So if memory saving is your concern, intern()
would be better. It'll save space, as well as GC time.
However, the performance of intern()
isn't that great, last time I heard. You might be better off by having your own string cache, even using a ConcurrentHashMap
... but you need to benchmark it to make sure.
As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/. It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended!
The main factor probably depends on what you are in control over:
Do you have full control over the choice of GC? In a GUI application for example, there is still a strong case to be made for using Serial GC. (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, e.g. after a transient spike in usage). So you might pick that or give your users the option. (If the heap remains small the pauses should not be a big deal).
Do you have full control over the code? The G1GC option is great for 3rd party libraries (and applications!) which you can't edit.
The second consideration (as per @ZhongYu's answer) is that String.intern
can de-duplication the String
objects themselves, whereas G1GC necessarily can only de-duplicate their private char[]
field.
A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. G1GC will run an extra thread dedicated to de-duplicating the heap. For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. (Of course there may be a comparable overhead if you call String.intern everywhere, which would also runs in serial, but then...)
You probably don't need string de-duplication everywhere. There are probably only certain areas of code that:
By using String.intern
selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price.
And finally, a quick plug for the Guava utility: Interner, which:
Provides equivalent behavior to
String.intern()
for other immutable types
You can also use that for Strings. Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern()
, even after tuning the jvm options. (And bonus: you don't need to tune the JVM options to scale to different input.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With