String class has some methods that i cannot understand why they were implemented like this... replace is one of them.
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Are there some significant advantages over a simpler and more efficient (fast!) method?
public static String replace(String string, String searchFor, String replaceWith) {
StringBuilder result=new StringBuilder();
int index=0;
int beginIndex=0;
while((index=string.indexOf(searchFor, index))!=-1){
result.append(string.substring(beginIndex, index)+replaceWith);
index+=searchFor.length();
beginIndex=index;
}
result.append(string.substring(beginIndex, string.length()));
return result.toString();
}
Stats with Java 7:
1,000,000 iterations
replace "b" with "x" in "a.b.c"
result: "a.x.c"
Times:
string.replace: 485ms
string.replaceAll: 490ms
optimized replace = 180ms
Code like the Java 7 split method is heavily optimized to avoid pattern compile / regex processing when possible:
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, value.length));
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
Following the logic of the replace method:
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
The split implementation should be:
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
The performance losses are not far from the ones found on the replace methods. For some reason Oracle gives a fastpath approach on some methods and not others.
Are you sure your proposed method is indeed faster than the regex-based one used by the String
class - not just for your own test input, but for every possible input that a program might throw at it? It relies on String.indexOf
to do substring matching, which is itself a naive implementation that is subject to bad worst-case performance. It's entirely possible that Pattern
implements a more sophisticated matching algorithm such as KMP to avoid redundant comparisons.
In general, the Java team takes performance of the core libraries very seriously, and maintains lots of internal benchmarks using a wide range of real-world data. I've never encountered a situation where regex processing was a bottleneck. My standing advice is to start by writing the simplest possible code that works correctly, and don't even begin to think about rewriting the Java built-ins until profiling proves that it's a bottleneck, and you've exhausted all other avenues of optimization.
Regarding your latest edit - first, I would not describe the split
method as heavily optimized. It handles one special case that happens to be extremely common and is guaranteed not to suffer from the poor worst-case complexity described above for the naive string matching algorithm - that of splitting on a single-character, literal token.
It may very well be that the same special case could be optimized for replace
, and would provide some measurable improvement. But look what it took to achieve that simple optimization - about 50 lines of code. Those lines of code come at a cost, especially when they're a part of what's probably the most widely-used class in the Java library. Cost comes in many forms:
Your question now boils down to "why was this one method optimized to handle a special case, but not the other one?" or even more generally "why was this particular feature not implemented?" Nobody but the original author can answer that definitively, but the answer is almost always that either there is not sufficient demand for that feature, or that the benefit derived from having the feature is deemed not worth the cost of adding it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With