Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Out of memory : Multithreading using hashset

I have implemented a java program . This is basically a multi threaded service with fixed number of threads. Each thread takes one task at a time, create a hashSet , the size of hashset can vary from 10 to 20,000+ items in a single hashset. At end of each thread, the result is added to a shared collection List using synchronized.

The problem happens is at some point I start getting out of memory exception. Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.

Please give me suggestions for how to deal with such large amount of data. Is Hashset a correct datastructure to be used? How to deal with memory exception, I mean one way is to use System.GC(), which is again not good as it will slow down the whole process. Or is it possible to dispose the "HashSet hsN" after I add it to the shared collection List?

Please let me know your thoughts and guide me for wherever I am going wrong. This service is going to deal with huge amout of data processing.

Thanks

//business object - to save the result of thread execution

public class Location{

    integer taskIndex;
    HashSet<Integer> hsN;
}



//task to be performed by each thread


public class MyTask implements Runnable {


    MyTask(long task) {
        this.task = task;
    }

    @Override
    public void run() {
              HashSet<Integer> hsN = GiveMeResult(task);//some function calling which returns a collection of integer where the size vary from 10 to 20000

        synchronized (locations) {
            locations.add(task,hsN);
        }
    }
}


public class Main {

    private static final int NTHREDS = 8;
    private static List<Location> locations;

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
        for (int i = 0; i < 216000; i++) {
            Runnable worker = new MyTask(i);
            executor.execute(worker);
        }
        // This will make the executor accept no new threads
        // and finish all existing threads in the queue
        executor.shutdown();
        // Wait until all threads are finish
        while (!executor.isTerminated()) {

        }
        System.out.println("Finished all threads");
    }
}

For such implementation is JAVA a best choice or C# .net4?

like image 662
user1213831 Avatar asked Mar 08 '26 08:03

user1213831


2 Answers

A couple of issues that I can see:

  • You synchronize on the MyTask object, which is created separately for each execution. You should be synchronizing on a shared object, preferably the one that you are modifying i.e. the locations object.

  • 216,000 runs, multiplied by say 10,000 returned objects each, multiplied by a minimum of 12 bytes per Integer object is about 24 GB of memory. Do you even have that much physical memory available on your computer, let alone available to the JVM?

    32-bit JVMs have a heap size limit of less than 2 GB. On a 64-bit JVM on the other hand, an Integer object takes about 16 bytes, which raises the memory requirements to over 30 GB.

    With these numbers it's hardly surprising that you get an OutOfMemoryError...

    PS: If you do have that much physical memory available and you still think that you are doing the right thing, you might want to have a look at tuning the JVM heap size.

EDIT:

Even with 25GB of memory available to the JVM it could still be pushing it:

  • Each Integer object requires 16 bytes on modern 64-bit JVMs.

  • You also need an 8-byte reference that will point to it, regardless of which List implementation you are using.

  • If you are using a linked list implementation, each entry will also have an overhead of at least 24 bytes for the list entry object.

At best you could hope to store about 1,000,000,000 Integer objects in 25GB - half that if you are using a linked list. That means that each task could not produce more than 5,000 (2,500 respectively) objects on average without causing an error.

I am unsure of your exact requirement, but have you considered returning a more compact object? For example an int[] array produced from each HashSet would only keep the minimum of 4 bytes per result without the object container overhead.

EDIT 2:

I just realized that you are storing the HashSet objects themselves in the list. HashSet objects use a HashMap internally which then uses a HashMap.Entry object of each entry. On an 64-bit JVM the entry object consumes about 40 bytes of memory in addition to the stored object:

  • The key reference which points to the Integer object - 8 bytes.

  • The value reference (always null in a HashSet) - 8 bytes.

  • The next entry reference - 8 bytes.

  • The hash value - 4 bytes.

  • The object overhead - 8 bytes.

  • Object padding - 4 bytes.

I.e. for each Integer object you need 56 bytes for storage in a HashSet. With the typical HashMap load factor of 0.75, you should add another 10 or bytes for the HashMap array references. With 66 bytes per Integer you can only store about 400,000,000 such objects in 25 GB, without taking into account the rest of your application any any other overhead. That's less than 2,000 object per task...

EDIT 3:

You would be better off storing a sorted int[] array instead of a HashSet. That array is searchable in logarithmic time for any arbitrary integer and minimizes the memory consumption to 4 bytes per number. Considering the memory I/O it would also be as fast (or faster) as the HashSet implementation.

like image 164
thkala Avatar answered Mar 09 '26 22:03

thkala


If you want a more memory efficient solution I would use TIntHashSet or a sorted int[]. In this case, you get a Full GC before an OutOfMemoryError. These are not the cause of the problem, but symptoms. The cause of the problem is you are using too much memory for the amount you are allowing as your maximum heap.

Another solution is to create tasks as you go instead of creating all your tasks in advance. You can do this by breaking your task in to NTHREAD tasks instead. It appears that you are trying to retain every solution. If so this won't help much. Instead you need to find a way to reduce consumption.

Depending on your distribution of numbers, a BitSet may be more efficient. This uses 1 bit per integer in a range. e.g. say your range is 0 - 20,000, This will use only 2.5 KB.

like image 39
Peter Lawrey Avatar answered Mar 09 '26 22:03

Peter Lawrey