Why is indexOf
significantly faster than contains
if the latter is merely a wrapper of the first one?
Code from the Java API:
public boolean contains(CharSequence s) {
return indexOf(s.toString()) > -1;
}
The chosen answer in this thread shows a short test, which shows the difference.
The chosen answer in this thread states that the overhead of the additional method call does not matter. So, why the difference?
Please read my edit: Almost everyone is saying the micro-benchmark is flawed. Strange thing is, it reflects exactly my use-case.
Actually, I was not doubting that indexOf
is faster than contains
(for my use-case) in the first place, I just wanted to know why.
My intention was never to write a benchmark! I was just searching for the most-efficient way to test if a string contains another one (for my application which has nothing to do with a benchmark but a 'real-life situation').
IndexOf(string) has no options and Contains() uses an Ordinal compare (a byte-by-byte comparison rather than trying to perform a smart compare, for example, e with é). So IndexOf will be marginally faster (in theory) as IndexOf goes straight to a string search using FindNLSString from kernel32.
Actually, the default string. Contains() or IndexOf() are horribly slow. But it comes to a simple option to make it far much faster.
The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.
Contains() ought to be around 50 nanoseconds. So the overall operation takes 50.05 microseconds. A faster Contains might take half the time, the overall operation takes 50.025 microseconds.
The contains
method is implemented as:
public boolean contains(CharSequence s) {
return indexOf(s.toString()) > -1;
}
Which means that it may be slower if CharSequence s
is not a java.lang.String
because calling s.toString()
will result in the allocation and initialization of a new string instance. If s
is a string - then there should not be any measurable difference.
PS: The test from here is flawed: https://stackoverflow.com/a/18340277/2588800 Java initially execute in "interpreted" mode, which is quite slow, and when it detects that a piece of codes gets executed a lot of times it compiles it to native code in order to speed it up (read about JIT compilation).
As you can see contains
internally calls indexOf
, which means that indexOf
eventually will get compiled to native. So when he tests the indexOf
(notice that he tests it after contains
) it might have already been compiled to native code. And that's the reason for the time difference. Try to reverse the order of that tests - first test indexOf
and then contains
and I bet you'll see just the opposite results.
Benchmark Mode Cnt Score Error Units
StringSearchBenchmark.testContains thrpt 500 22,071 ± 0,269 ops/us
StringSearchBenchmark.testIndexOf thrpt 500 22,654 ± 0,233 ops/us
As you can see the difference is negligible and might be casued by the additional method calls (indexOf()
+ toString()
) and load on the system.
Source-Code:
@Fork(1)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(iterations = 500, time = 50, timeUnit = TimeUnit.MILLISECONDS)
@Warmup(iterations = 10)
@BenchmarkMode(Mode.Throughput)
public class StringSearchBenchmark {
private static final String STATE = "absdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyz";
private static final String SEARCH_TERM = "abcd";
@Benchmark
public void testContains(Blackhole sink) {
sink.consume(STATE.contains(SEARCH_TERM));
}
@Benchmark
public void testIndexOf(Blackhole sink) {
sink.consume(STATE.indexOf(SEARCH_TERM));
}
}
As others have said, the benchmark was heavily flawed - performance testing of Java code does not work like that - you must warm it up to ensure that all classes have been loaded and parsed, that all objects have been loaded into memory, and that any compiling down to native code, e.g. via HotSpot, has been done. A naïve benchmark where you just run the code once in the main method is not really going to fly. A much better choice is to use something like JMH. Given the following test:
package com.stackoverflow.example;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(time = 250, timeUnit = TimeUnit.MILLISECONDS)
public class MyBenchmark {
private static final String[] names = new String[]{"jack", "jackson", "jason", "jadifu"};
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(MyBenchmark.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
@Benchmark
public void contains() {
names[0].contains("ja");
}
@Benchmark
public void containsExplicit() {
names[0].indexOf("ja".toString());
}
@Benchmark
public void indexOf() {
names[0].indexOf("ja");
}
@Benchmark
public void matches() {
names[0].matches(".*ja.*");
}
}
I get the following results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.contains thrpt 20 219.770 ± 2.032 ops/us
MyBenchmark.containsExplicit thrpt 20 1820.024 ± 20.583 ops/us
MyBenchmark.indexOf thrpt 20 1828.234 ± 18.744 ops/us
MyBenchmark.matches thrpt 20 3.933 ± 0.052 ops/us
Now, that's fairly interesting, as it still suggests that contains
is significantly slower than indexOf
. However, if I change the test up, very slightly, to the following:
@Benchmark
public void contains() {
assert names[0].contains("ja");
}
@Benchmark
public void containsExplicit() {
assert names[0].indexOf("ja".toString()) == 0;
}
@Benchmark
public void indexOf() {
assert names[0].indexOf("ja") == 0;
}
@Benchmark
public void matches() {
assert names[0].matches(".*ja.*");
}
I get the following results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.contains thrpt 20 220.480 ± 1.266 ops/us
MyBenchmark.containsExplicit thrpt 20 219.962 ± 2.329 ops/us
MyBenchmark.indexOf thrpt 20 219.706 ± 2.401 ops/us
MyBenchmark.matches thrpt 20 3.766 ± 0.026 ops/us
In this, we're getting the same result for contains, but indexOf
has slowed down to match contains
. That's a very interesting result. Why is this happening?
Probably due to HotSpot recognising that the result of the indexOf
call is never inspected, and since it's taking a final
class (String
), HotSpot is likely able to guarantee that there are no side effects to the call. So if we're not looking at the result and there are no side effects to the call, why are we making it? HotSpot is able to realise that a method call is pointless, and remove it altogether, which could be what's happening here. It would certainly explain the order of magnitude difference.
Why doesn't this work for contains
, though? I can only assume that it's because contains
accepts a CharSequence
, not a String
, which is an abstract class, and that's just enough to prevent HotSpot from optimising the method call away.
This also indicates that micro-benchmarks are hard in Java - there is a lot going on beneath the surface to optimise your running code, and a few shortcuts can result in extremely inaccurate benchmarks.
indexOf is sample of intrinsic method in Hotspot JVM. It means that java code from java.lang.String is not used at all for this method. There is special native version of this method. You can find list of such methods here: do_intrinsic(_indexOf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With