Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java 7 String - substring complexity

Tags:

java

java-7

Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?

like image 544
anoopelias Avatar asked Apr 20 '13 17:04

anoopelias


People also ask

What is the complexity of substring in Java?

You can generate substrings of a String using StringBuilder/StringBuffer class in java. The time complexity will be O(n)2.

What is the time complexity of toCharArray?

Next let's talk about complexity. In toCharArray(), it takes O(n) time to return the result, where n is the length of the string. It also takes O(n) space as it creates a defensive copy of the original string. In charAt(), it takes O(1) time to return the result as it's a random access.

What is substring in selenium?

substring( ) method is used to extract a portion from the actual string using the following two formats: "StringText". substring(startIndex) -> Example: "SunShine". substring(3) -> It will extract the portion of string starting from Index 3 to the end of the string.


3 Answers

Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:

When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.

There's also this note, noting that what once was an optimization had become a pessimization according to tests:

For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.

like image 98
Andy Thomas Avatar answered Oct 19 '22 08:10

Andy Thomas


If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.

The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.

Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.

like image 22
ILMTitan Avatar answered Oct 19 '22 10:10

ILMTitan


This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

like image 5
Heisenberg Avatar answered Oct 19 '22 08:10

Heisenberg