Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String: Why is indexOf significantely faster than contains?

Why is indexOf significantly faster than contains if the latter is merely a wrapper of the first one?

Code from the Java API:

public boolean contains(CharSequence s) {
     return indexOf(s.toString()) > -1;
}

The chosen answer in this thread shows a short test, which shows the difference.

The chosen answer in this thread states that the overhead of the additional method call does not matter. So, why the difference?

Please read my edit: Almost everyone is saying the micro-benchmark is flawed. Strange thing is, it reflects exactly my use-case.

Actually, I was not doubting that indexOf is faster than contains (for my use-case) in the first place, I just wanted to know why.

My intention was never to write a benchmark! I was just searching for the most-efficient way to test if a string contains another one (for my application which has nothing to do with a benchmark but a 'real-life situation').

like image 296
Willi Mentzel Avatar asked Aug 19 '16 09:08

Willi Mentzel


People also ask

Which is faster IndexOf or contains?

IndexOf(string) has no options and Contains() uses an Ordinal compare (a byte-by-byte comparison rather than trying to perform a smart compare, for example, e with é). So IndexOf will be marginally faster (in theory) as IndexOf goes straight to a string search using FindNLSString from kernel32.

Does string contain slow Java?

Actually, the default string. Contains() or IndexOf() are horribly slow. But it comes to a simple option to make it far much faster.

How does IndexOf string work?

The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.

How fast is contain?

Contains() ought to be around 50 nanoseconds. So the overall operation takes 50.05 microseconds. A faster Contains might take half the time, the overall operation takes 50.025 microseconds.


3 Answers

The contains method is implemented as:

public boolean contains(CharSequence s) {
    return indexOf(s.toString()) > -1;
}

Which means that it may be slower if CharSequence s is not a java.lang.String because calling s.toString() will result in the allocation and initialization of a new string instance. If s is a string - then there should not be any measurable difference.

PS: The test from here is flawed: https://stackoverflow.com/a/18340277/2588800 Java initially execute in "interpreted" mode, which is quite slow, and when it detects that a piece of codes gets executed a lot of times it compiles it to native code in order to speed it up (read about JIT compilation).

As you can see contains internally calls indexOf, which means that indexOf eventually will get compiled to native. So when he tests the indexOf (notice that he tests it after contains) it might have already been compiled to native code. And that's the reason for the time difference. Try to reverse the order of that tests - first test indexOf and then contains and I bet you'll see just the opposite results.

JMH to the rescue

Benchmark                            Mode  Cnt   Score   Error   Units
StringSearchBenchmark.testContains  thrpt  500  22,071 ± 0,269  ops/us
StringSearchBenchmark.testIndexOf   thrpt  500  22,654 ± 0,233  ops/us

As you can see the difference is negligible and might be casued by the additional method calls (indexOf() + toString()) and load on the system.

Source-Code:

@Fork(1)
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(iterations = 500, time = 50, timeUnit = TimeUnit.MILLISECONDS)
@Warmup(iterations = 10)
@BenchmarkMode(Mode.Throughput)
public class StringSearchBenchmark {
    private static final String STATE = "absdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyzabsdefghijklmnopqrstuvwxyz";
    private static final String SEARCH_TERM = "abcd";

    @Benchmark
    public void testContains(Blackhole sink) {
        sink.consume(STATE.contains(SEARCH_TERM));
    }

    @Benchmark
    public void testIndexOf(Blackhole sink) {
        sink.consume(STATE.indexOf(SEARCH_TERM));
    }
}
like image 132
Svetlin Zarev Avatar answered Oct 12 '22 23:10

Svetlin Zarev


As others have said, the benchmark was heavily flawed - performance testing of Java code does not work like that - you must warm it up to ensure that all classes have been loaded and parsed, that all objects have been loaded into memory, and that any compiling down to native code, e.g. via HotSpot, has been done. A naïve benchmark where you just run the code once in the main method is not really going to fly. A much better choice is to use something like JMH. Given the following test:

package com.stackoverflow.example;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Measurement(time = 250, timeUnit = TimeUnit.MILLISECONDS)    
public class MyBenchmark {

    private static final String[] names = new String[]{"jack", "jackson", "jason", "jadifu"};

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(MyBenchmark.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }

    @Benchmark
    public void contains() {
        names[0].contains("ja");
    }

    @Benchmark
    public void containsExplicit() {
        names[0].indexOf("ja".toString());
    }

    @Benchmark
    public void indexOf() {
        names[0].indexOf("ja");
    }

    @Benchmark
    public void matches() {
        names[0].matches(".*ja.*");
    }
}

I get the following results:

Benchmark                      Mode  Cnt     Score    Error   Units
MyBenchmark.contains          thrpt   20   219.770 ±  2.032  ops/us
MyBenchmark.containsExplicit  thrpt   20  1820.024 ± 20.583  ops/us
MyBenchmark.indexOf           thrpt   20  1828.234 ± 18.744  ops/us
MyBenchmark.matches           thrpt   20     3.933 ±  0.052  ops/us

Now, that's fairly interesting, as it still suggests that contains is significantly slower than indexOf. However, if I change the test up, very slightly, to the following:

@Benchmark
public void contains() {
    assert names[0].contains("ja");
}

@Benchmark
public void containsExplicit() {
    assert names[0].indexOf("ja".toString()) == 0;
}

@Benchmark
public void indexOf() {
    assert names[0].indexOf("ja") == 0;
}

@Benchmark
public void matches() {
   assert names[0].matches(".*ja.*");
}

I get the following results:

Benchmark                      Mode  Cnt    Score   Error   Units
MyBenchmark.contains          thrpt   20  220.480 ± 1.266  ops/us
MyBenchmark.containsExplicit  thrpt   20  219.962 ± 2.329  ops/us
MyBenchmark.indexOf           thrpt   20  219.706 ± 2.401  ops/us
MyBenchmark.matches           thrpt   20    3.766 ± 0.026  ops/us

In this, we're getting the same result for contains, but indexOf has slowed down to match contains. That's a very interesting result. Why is this happening?

Probably due to HotSpot recognising that the result of the indexOf call is never inspected, and since it's taking a final class (String), HotSpot is likely able to guarantee that there are no side effects to the call. So if we're not looking at the result and there are no side effects to the call, why are we making it? HotSpot is able to realise that a method call is pointless, and remove it altogether, which could be what's happening here. It would certainly explain the order of magnitude difference.

Why doesn't this work for contains, though? I can only assume that it's because contains accepts a CharSequence, not a String, which is an abstract class, and that's just enough to prevent HotSpot from optimising the method call away.

This also indicates that micro-benchmarks are hard in Java - there is a lot going on beneath the surface to optimise your running code, and a few shortcuts can result in extremely inaccurate benchmarks.

like image 39
ipsi Avatar answered Oct 12 '22 23:10

ipsi


indexOf is sample of intrinsic method in Hotspot JVM. It means that java code from java.lang.String is not used at all for this method. There is special native version of this method. You can find list of such methods here: do_intrinsic(_indexOf

like image 25
sibnick Avatar answered Oct 13 '22 01:10

sibnick