Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing optimal number of Threads for parallel processing of data

Let's say I have a task with processing 1 million sentences.

For each sentence, I need to do something with it, and it makes no matter what particular order they are processed in.

In my Java program I have a set of futures partitioned from my main chunk of work with a callable that defines the unit of work to be done on a chunk of sentences, and I'm looking for a way to optimize the number of threads I allocate to work through the big block of sentences, and later recombine all the results of each thread.

What would be the maximum number of threads I could use that would give me optimal performance in terms of speed before I saw diminishing returns?

Also, what causes the logic that the more threads allocated, ie more being able to be done at once, to be incorrect?

like image 349
Adam Bronfin Avatar asked Jun 10 '14 19:06

Adam Bronfin


2 Answers

In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.

Also, what causes the logic that the more threads allocated, i.e. 
more being able to be done at once, to be incorrect?

The reason that as more threads are allocated leads to more work being done concurrently is false because only 1 (or 2 threads if the cores are "hyper threaded") can run at a single time on each core.

So assume I have a quad core machine that is not hyper threaded. In that case, i can run up to 4 threads concurrently. So, my maximum throughput should be achieved with 4 threads. Say if I try to run 8 threads on the same setup. In this case, the kernel would schedule these threads back and forth (by way of a context switch), and would block one thread in order to let another thread run. So, at most, the work of 4 threads can be run at a single time.

For more information on this, it would be extremely helpful to look up "context switch" with a Linux kernel. That will provide you with all the information you ever wanted on this subject.

Also, note that there is a difference between threads called "user level threads" and "kernel level threads". This is an important distinction if you research this topic further, but it is outside the scope of this question.

like image 171
Rich E Avatar answered Oct 19 '22 15:10

Rich E


Is your load I/O bound? I/O bound means the CPU waits most of the time for I/O operations being done. Adding more threads means, sending more requests to the I/O subsystem or a remote server, etc. This may have positive effects, because requests to storage can be reordered and combined (scatter gather), but only until you reach the maximum possible I/O bandwidth. Adding more threads may also have adverse effects, e.g. when more random I/O requests are executed on a conventional harddisk.

If your load is I/O bound you can do various approaches to optimize the I/O operations. My first choice is to load data in greater chunks and in a more streaming manner, if this is possible. The next thing is to use external index structures or databases if you have lots of point accesses or more disks, if just bandwidth is missing. Anyway, optimizing I/O is another broad topic...

Is your load CPU bound? This means for processing the CPU power is the limiting factor, not the I/O bandwith. Optimizing you I/O subsystem makes no sense in this case, you need more or faster CPUs and you need to distribute the load.

In your particular case, you can load all data into memory, then your load is solely CPU bound. For CPU bound loads it is the best to use a thread count identical to the number of CPU cores in your machine. Choosing the number of CPUs as thread count is rather straight forward and obvious. It also is discussed in the question Optimal number of threads per core.

In practice, to execute your tasks in the Callable objects use an ExecutorService constructed that way:

  int maxThreadCount = Runtime.getRuntime().availableProcessors();
  ExecutorService executor = 
    new ThreadPoolExecutor(
      0, maxThreadCount - 1,
      1, TimeUnit.SECONDS,
      new LinkedBlockingDeque<>(maxThreadCount * 2),
      Executors.defaultThreadFactory(),
      new ThreadPoolExecutor.CallerRunsPolicy());

Now do the processing by adding your tasks and wait until everything is finished:

  while (moreToDo) {
    Callable c =...
    executor.submit(c);
  }
  executor.shutdown();
  executor.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);

The thread pool parameters are a little tricky. Here is a detailed explaination:

By using new ThreadPoolExecutor.CallerRunsPolicy() the task generator thread will stall generating new tasks when all threads in the pool are in use. To be more precise, the calling thread will execute a task as well, when the queue limit is reached.

maxThreadCount - 1: Since we also use the caller thread thread pool size is reduced by one.

new LinkedBlockingDeque<>(maxThreadCount * 2): For the queue size of the blocking queue a small value is chosen, the idea is, that by having some tasks in the queue, the pool threads get new jobs while the caller thread is processing a job itself. If tasks are very irregular in running time, this is not totally perfect. The ThreadPoolExecutor should have a cleaner approach for this use case. The better approach would be to use a SnychronosQueue and to make the submit wait until a thread is available. However, the ThreadPoolExecutor has no "always queue" operation mode, instead, it tries to queue and calls the RejectionPolicy if the queue is not possible right now.

This should do it in your scenario.

There may be loads when you don't know in advance whether it is CPU bound or I/O bound, and, to complicate things, the load may change its behavior within processing. My idea to tackle this, is with an adaptive algorithm similar to the approach in TCP congestion avoidance algorithm. The congestion avoidance in TCP is exactly the same sort of problem: "I want maximum throughput, but I don't know my resources". Anybody worked on this?

like image 29
cruftex Avatar answered Oct 19 '22 15:10

cruftex