Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When will the new String() object in memory gets cleared after invoking intern() method

List<String> list = new ArrayList<>();
for (int i = 0; i < 1000; i++)
{
    StringBuilder sb = new StringBuilder();
    String string = sb.toString();
    string = string.intern()
    list.add(string);
}

In the above sample, after invoking string.intern() method, when will the 1000 objects created in heap (sb.toString) be cleared?


Edit 1: If there is no guarantee that these objects could be cleared. Assuming that GC haven't run, is it obsolete to use string.intern() itself? (In terms of the memory usage?)

Is there any way to reduce memory usage / object creation while using intern() method?

like image 228
Gokul Raj Kumar Avatar asked Jan 05 '18 12:01

Gokul Raj Kumar


People also ask

What is the use of the intern () method it creates a new string in database?

The intern() method creates an exact copy of a string that is present in the heap memory and stores it in the String constant pool if not already present. If the string is already present, it returns the reference. The intern() method helps to save memory space and reuse it efficiently at the cost of time.

What does string's intern () method do?

The method intern() creates an exact copy of a String object in the heap memory and stores it in the String constant pool. Note that, if another String with the same contents exists in the String constant pool, then a new object won't be created and the new reference will point to the other String.

What is String intern () When and why should it be used?

String Interning is a method of storing only one copy of each distinct String Value, which must be immutable. By applying String. intern() on a couple of strings will ensure that all strings having the same contents share the same memory.

What is the use of the intern () method A It returns the existing string from memory B it creates a new string in the database C it modifies the existing string in the database?

27) What is the use of the intern() method? Explanation: The intern() method is used to return the existing strings from the database. In other words, the intern() method returns a reference of the string.


1 Answers

Your example is a bit odd, as it creates 1000 empty strings. If you want to get such a list with consuming minimum memory, you should use

List<String> list = Collections.nCopies(1000, "");

instead.

If we assume that there is something more sophisticated going on, not creating the same string in every iteration, well, then there is no benefit in calling intern(). What will happen, is implemen­tation dependent. But when calling intern() on a string that is not in the pool, it will be just added to the pool in the best case, but in the worst case, another copy will be made and added to the pool.

At this point, we have no savings yet, but potentially created additional garbage.

Interning at this point can only save you some memory, if there are duplicates somewhere. This implies that you construct duplicate strings first, to look up their canonical instance via intern() afterwards, so having the duplicate string in memory until garbage collected, is unavoidable. But that’s not the real problem with interning:

  • in older JVMs, there was special treatment of interned string that could result in worse garbage collection performance or even running out of resources (i.e. the fixed size “PermGen” space).
  • in HotSpot, the string pool holding the interned strings is a fixed size hash table, yielding hash collisions, hence, poor performance, when referencing significantly more strings than the table size.
    Before Java 7, update 40, the default size was about 1,000, not even sufficient to hold all string constants for any nontrivial application without hash collisions, not to speak of manually added strings. Later versions use a default size of about 60,000, which is better, but still a fixed size that should discourage you from adding an arbitrary number of strings
  • the string pool has to obey inter-thread semantics mandated by the language specification (as it is used to for string literals), hence, need to perform thread safe updates that can degrade the performance

Keep in mind that you pay the price of the disadvantages named above, even in the cases that there are no duplicates, i.e. there is no space saving. Also, the acquired reference to the canonical string has to have a much longer lifetime than the temporary object used to look it up, to have any positive effect on the memory consumption.

The latter touches your literal question. The temporary instances are reclaimed when the garbage collector runs the next time, which will be when the memory is actually needed. There is no need to worry about when this will happen, but well, yes, up to that point, acquiring a canonical reference had no positive effect, not only because the memory hasn’t been reused up to that point, but also, because the memory was not actually needed until then.

This is the place to mention the new String Deduplication feature. This does not change string instances, i.e. the identity of these objects, as that would change the semantic of the program, but change identical strings to use the same char[] array. Since these character arrays are the biggest payload, this still may achieve great memory savings, without the performance disadvan­tages of using intern(). Since this deduplication is done by the garbage collector, it will only applied to strings that survived long enough to make a difference. Also, this implies that it will not waste CPU cycles when there still is plenty of free memory.


However, there might be cases, where manual canonicalization might be justified. Imagine, we’re parsing a source code file or XML file, or importing strings from an external source (Reader or data base) where such canonicalization will not happen by default, but duplicates may occur with a certain likelihood. If we plan to keep the data for further processing for a longer time, we might want to get rid of duplicate string instances.

In this case, one of the best approaches is to use a local map, not being subject to thread synchronization, dropping it after the process, to avoid keeping references longer than necessary, without having to use special interaction with the garbage collector. This implies that occurrences of the same strings within different data sources are not canonicalized (but still being subject to the JVM’s String Deduplication), but it’s a reasonable trade-off. By using an ordinary resizable HashMap, we also do not have the issues of the fixed intern table.

E.g.

static List<String> parse(CharSequence input) {
    List<String> result = new ArrayList<>();

    Matcher m = TOKEN_PATTERN.matcher(input);
    CharBuffer cb = CharBuffer.wrap(input);
    HashMap<CharSequence,String> cache = new HashMap<>();
    while(m.find()) {
        result.add(
            cache.computeIfAbsent(cb.subSequence(m.start(), m.end()), Object::toString));
    }
    return result;
}

Note the use of the CharBuffer here: it wraps the input sequence and its subSequence method returns another wrapper with different start and end index, implementing the right equals and hashCode method for our HashMap, and computeIfAbsent will only invoke the toString method, if the key was not present in the map before. So, unlike using intern(), no String instance will be created for already encountered strings, saving the most expensive aspect of it, the copying of the character arrays.

If we have a really high likelihood of duplicates, we may even save the creation of wrapper instances:

static List<String> parse(CharSequence input) {
    List<String> result = new ArrayList<>();

    Matcher m = TOKEN_PATTERN.matcher(input);
    CharBuffer cb = CharBuffer.wrap(input);
    HashMap<CharSequence,String> cache = new HashMap<>();
    while(m.find()) {
        cb.limit(m.end()).position(m.start());
        String s = cache.get(cb);
        if(s == null) {
            s = cb.toString();
            cache.put(CharBuffer.wrap(s), s);
        }
        result.add(s);
    }
    return result;
}

This creates only one wrapper per unique string, but also has to perform one additional hash lookup for each unique string when putting. Since the creation of a wrapper is quiet cheap, you really need a significantly large number of duplicate strings, i.e. small number of unique strings compared to the total number, to have a benefit from this trade-off.

As said, these approaches are very efficient, because they use a purely local cache that is just dropped afterwards. With this, we don’t have to deal with thread safety nor interact with the JVM or garbage collector in a special way.

like image 115
Holger Avatar answered Nov 14 '22 21:11

Holger