Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does java.util.Collections.contains() perform faster than a linear search?

I've been fooling around with a bunch of different ways of searching collections, collections of collections, etc. Doing lots of stupid little tests to verify my understanding. Here is one which boggles me (source code further below).

In short, I am generating N random integers and adding them to a list. The list is NOT sorted. I then use Collections.contains() to look for a value in the list. I intentionally look for a value that I know won't be there, because I want to ensure that the entire list space is probed. I time this search.

I then do another linear search manually, iterating through each element of the list and checking if it matches my target. I also time this search.

On average, the second search takes 33% longer than the first one. By my logic, the first search must also be linear, because the list is unsorted. The only possibility I could think of (which I immediately discard) is that Java is making a sorted copy of my list just for the search, but (1) I did not authorize that usage of memory space and (2) I would think that would result in MUCH more significant time savings with such a large N.

So if both searches are linear, they should both take the same amount of time. Somehow the Collections class has optimized this search, but I can't figure out how. So... what am I missing?

import java.util.*;

public class ListSearch {

    public static void main(String[] args) {

        int N = 10000000; // number of ints to add to the list
        int high = 100; // upper limit for random int generation

        List<Integer> ints;
        int target = -1; // target will not be found, forces search of entire list space

        long start;
        long end;

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 1)... ");
        if (ints.contains(target)) {
            // nothing
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");

        System.out.println();

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 2)... ");
        for (Integer i : ints) {
            // nothing
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");
    }
}

EDIT: Below is a new version of this code. The interesting thing is that now my manual linear loop performs 16% faster than the contains method (NOTE: both are designed to intentionally search the entire list space, so I know they are equal in number of iterations). I can't account for this 16% gain... more confusion.

import java.util.*;

public class ListSearch {

    public static void main(String[] args) {

        int N = 10000000; // number of ints to add to the list
        int high = 100; // upper limit for random int generation

        List<Integer> ints;
        int target = -1; // target will not be found, forces search of entire list space

        long start;
        long end;

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 1)... ");
        if (ints.contains(target)) {
            System.out.println("hit");
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");

        System.out.println();

        ints = new ArrayList<Integer>();
        start = System.currentTimeMillis();
        System.out.print("Generating new list... ");
        for (int i = 0; i < N; i++) {
            ints.add(((int) (Math.random() * high)) + 1);
        }
        end = System.currentTimeMillis();
        System.out.println("took "  + (end-start) + "ms.");
        start = System.currentTimeMillis();
        System.out.print("Searching list for target (method 2)... ");
        for (int i = 0; i < N; i++) {
            if (ints.get(i) == target) {
                System.out.println("hit");
            }
        }
        end = System.currentTimeMillis();
        System.out.println(" Took "  + (end-start) + "ms.");
    }
}
like image 874
The111 Avatar asked Oct 20 '12 04:10

The111


People also ask

Which collection is faster for searching in Java?

If you need fast access to elements using index, ArrayList should be choice. If you need fast access to elements using a key, use HashMap. If you need fast add and removal of elements, use LinkedList (but it has a very poor seeking performance).

What is the complexity of Contains method in Java?

contains() method requires O(n) time. So the time we spend to find a specific object here depends on the number of items we have in the array.

Which collection is best for search operation?

1. HashSet and LinkedHashSet have the best performance on search operations.

Which is faster set or List in Java?

As Will has noted, because of the dictionary structure HashSet is probably a bit slower than an ArrayList (unless you want to insert "between" existing elements). It also is a bit larger.


3 Answers

Your comparison code is buggy, and this is distorting your results.

This does search for the target

    if (ints.contains(target)) {
        // nothing
    }

But this does not!

    for (Integer i : ints) {
        // nothing
    }

You are actually just iterating over the list elements without testing them.

Having said that, the second version is slower than the first version for one or more of the following reasons:

  • The first version will be iterating the backing array using a simple for loop and an index. The second version is equivalent to the following:

    Iterator<Integer> it = ints.iterator();
    while (it.hasNext()) {
        Integer i = (Integer) it.next();
    }
    

    In other words, each time around the loop involves 2 method calls and a typecast1.

  • The first version will return true as soon as it gets a match. Because of the bug in your implementation, the second version will iterate the entire list every time. In fact, given the choice of N and high, this effect is the most likely the main cause of the difference in performance.

1 - Actually, it is not entirely clear what the JIT compiler will do with all of this. It could in theory inline the method calls, deduce that the typcaset is unnecessary, or even optimize away the entire loop. On the other hand, there are factors that may inhibit these optimizations. For instance, ints is declared as List<Integer>which could inhibit inlining ... unless the JIT is able to deduce that the actual type is always the same.


Your results are possibly also distorted for another reason. Your code does not take account of JVM warmup. Read this Question for more details: How do I write a correct micro-benchmark in Java?

like image 96
Stephen C Avatar answered Oct 07 '22 23:10

Stephen C


Here is the difference:

When you use contains, it use the internal array of the object and does a search like this:

    for (int i = 0; i < size; i++)
        if (searchObject.equals(listObject[i]))
            return true;
    return false;

Here when it tries to get the ith element, it directly gets the ith element object form the internal array.

When you write your own, you are writing like:

    for (Integer i : ints) {
        // nothing
    }

its equivalent to :

   for(Iterator<Integer> iter= ints.iterator(); iter.hasNext(); ) {
         Integer i = iter.next();
   }

Which is performing much more steps than the contains.

like image 33
Yogendra Singh Avatar answered Oct 07 '22 22:10

Yogendra Singh


So I'm not entirely sure you're testing anything. Javac (compiler) is smart enough to realize that you don't have any code inside your for loop and your if statement. In this case, Java will remove that code from it's compilation. The reason you might be getting time back is because you're actually counting the time it takes to print a string. System output time can vary drastically depending on what your system is doing. When writing timing tests, any I/O can create invalid tests.

First I would remove the string prints from inside your timing.

Second of all, ArrayList.contains is linear. It doesn't use the special for loop like you're doing. Your loop has some overhead of getting the iterator from the collection and then iterating over it. This is how the special for loop works behind the scenes.

Hope this helps.

like image 1
Kurtymckurt Avatar answered Oct 07 '22 23:10

Kurtymckurt