Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Small sets in Java: which datastructure?

Tags:

java

set

Is there any good reference on, or can someone tell me more about, the performance of various Java set implementations on small sets (say 1-100 elements)? The O(1) vs O(log n) story is pretty much irrelevant for these sizes, but since I need to handle millions of these small sets, performance certainly does matter. Most references I find don't mention much about this.

I would need to do the following with these sets (only a few times per set usually):

  • initializing a new set and/or hard-copy an old one
  • adding/removing elements
  • iterating over the set
  • computing the hashCode() of the entire set

I think these are the viable options to compare (assumed comparing/hashing T is almost free):

  • HashSet<T>: seems to be bad at iterating (and hence at hashCode())
  • TreeSet<T>: seems to have ridiculously high overhead
  • LinkedHashSet<T>: no experience with this at all, does it have high overhead?
  • ArrayList<T>: fast in itself but not a set, so ugly tricks like Collections.sort() needed...

Which of the above is generally preferred? Or should I write my own SmallSet<T> class?

like image 897
user1111929 Avatar asked Sep 10 '12 06:09

user1111929


People also ask

What data structure does Set use in Java?

The set interface is present in java.util package and extends the Collection interface is an unordered collection of objects in which duplicate values cannot be stored. It is an interface that implements the mathematical set.

Is ArrayList a data structure?

An ArrayList is a built-in data structure that uses a dynamic array to store its elements. In order to use this data structure, you must import java. util. ArrayList at the top of your program.

Which is faster list or Set in Java?

Core Java bootcamp program with Hands on practice The array is faster in case of access to an element while List is faster in case of adding/deleting an element from the collection.

What is the difference between Set and HashSet?

A Set is a generic set of values with no duplicate elements. A TreeSet is a set where the elements are sorted. A HashSet is a set where the elements are not sorted or ordered.


2 Answers

This is a small Set implementation as arrays:

It is so easy to adopt to your needs :)

Source: https://highlyscalable.wordpress.com/2011/12/29/ultimate-sets-and-maps-for-java-p1/

public class ArraySet {
    private int[] array;
    private int size = 0;

    public ArraySet(int capacity) {
        array = new int[capacity];
        Arrays.fill(array, -1);
    }

    public boolean add(int key) {
        int index = Arrays.binarySearch(array, 0, size, key);
        if (index < 0) {
            int insertIndex = -index-1;

            if(size < array.length - 1) {
                if(insertIndex < size) {
                    System.arraycopy(array, insertIndex, array, insertIndex + 1, size - insertIndex);
                }
                array[insertIndex] = key;
            } else {
                int[] newArray = new int[array.length + 1];
                System.arraycopy(array, 0, newArray, 0, insertIndex);
                System.arraycopy(array, insertIndex, newArray, insertIndex + 1, array.length - insertIndex);
                newArray[insertIndex] = key;
                array = newArray;
            }

            size++;
            return true;
        }
        return false;
    }

    public int get(int position) {
        return array[position];
    }

    public int size() {
        return size;
    }

    public boolean contains(int key) {
        return Arrays.binarySearch(array, key) >= 0;
    }
}
like image 180
Yan King Yin Avatar answered Sep 22 '22 05:09

Yan King Yin


If you are really looking for performance, then there's nothing short of testing for yourself that will help you here:

  • Are you constantly allocating them new? if so, garbage collection might be more relevant than in other cases
  • Are you allocating them once and need quick access? Hashcollisions will have an effect on that
  • Are you constantly changing them?

You'll need to setup a testcase that is similar to your actual use - test long enough that GC kicks in and you see the effects there.

And if you detect critical differences between them, rerun the tests after each and every update of the JVM as the implementations might change.

Until you have done such performance tests, I will give my standard advice: Choose the best readable option and only change that when there are obvious gains from using a less readable. The code maintainers (might be a future you) will thank you for that.

like image 27
Olaf Kock Avatar answered Sep 18 '22 05:09

Olaf Kock