Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel code bad scalability

Recently I've been analyzing how my parallel computations actually speed up on 16-core processor. And the general formula that I concluded - the more threads you have the less speed per core you get - is embarassing me. Here are the diagrams of my cpu load and processing speed:

picture1

So, you can see that processor load increases, but speed increases much slower. I want to know why such an effect takes place and how to get the reason of unscalable behaviour. I've made sure to use Server GC mode. I've made sure that I'm parallelizing appropriate code as soon as code does nothing more than

  • Loads data from RAM (server has 96 GB of RAM, swap file shouldn't be hit)
  • Performs not complex calculations
  • Stores data in RAM

I've profiled my application carefully and found no bottlenecks - looks like each operation becomes slower as thread number grows.

I'm stuck, what's wrong with my scenario?

I use .Net 4 Task Parallel Library.

like image 969
Konstantin Chernov Avatar asked Oct 11 '12 18:10

Konstantin Chernov


4 Answers

You will always get this kind of curve, it's called Amdahl's law.
The question is how soon it will level off.

You say you checked your code for bottlenecks, let's assume that's correct. Then there is still the memory bandwidth and other hardware factors.

like image 190
Henk Holterman Avatar answered Oct 17 '22 16:10

Henk Holterman


The key to a linear scalability - in the context of where going from one to two cores doubles the throughput - is to use shared resources as little as possible. This means:

  • don't use hyperthreading (because the two threads share the same core resource)
  • tie every thread to a specific core (otherwise the OS will juggle the threads between cores)
  • don't use more threads than there are cores (the OS will swap in and out)
  • stay inside the core's own caches - nowadays the L1 & L2 caches
  • don't venture into the L3 cache or RAM unless it is absolutely necessary
  • minimize/economize on critical section/synchronization usage

If you've come this far you've probably profiled and hand-tuned your code too.

Thread pools are a compromise and not suited for uncompromising, high-performance applications. Total thread control is.

Don't worry about the OS scheduler. If your application is CPU-bound with long computations that mostly does local L1 & L2 memory accesses it's a better performance bet to tie each thread to its own core. Sure the OS will come in but compared to the work being performed by your threads the OS work is negligible.

Also I should say that my threading experience is mostly from Windows NT-engine machines.

_______EDIT_______

Not all memory accesses have to do with data reads and writes (see comment above). An often overlooked memory access is that of fetching code to be executed. So my statement about staying inside the core's own caches implies making sure that ALL necessary data AND code reside in these caches. Remember also that even quite simple OO code may generate hidden calls to library routines. In this respect (the code generation department), OO and interpreted code is a lot less WYSIWYG than perhaps C (generally WYSIWYG) or, of course, assembly (totally WYSIWYG).

like image 23
Olof Forshell Avatar answered Oct 17 '22 16:10

Olof Forshell


A general decrease in return with more threads could indicate some kind of bottle neck.

Are there ANY shared resources, like a collection or queue or something or are you using some external functions that might be dependent on some limited resource?

The sharp break at 8 threads is interesting and in my comment I asked if the CPU is a true 16 core or an 8 core with hyper threading, where each core appears as 2 cores to the OS.

If it is hyper threading, you either have so much work that the hyper threading cannot double the performance of the core, or the memory pipe to the core cannot handle twice the data through put.

Are the work performed by the threads even or are some threads doing more than others, that could also indicate resource starvation.

Since your added that threads query for data very often, that indicates a very large risk of waiting.

Is there any way to let the threads get more data each time? Like reading 10 items instead of one?

like image 3
David Mårtensson Avatar answered Oct 17 '22 14:10

David Mårtensson


If you are doing memory intensive stuff, you could be hitting cache capacity.

You could maybe test this with mock algorithm which just processes same small bit if data over and over so it all should fit in cache.

If it indeed is cache, possible solutions could be making the threads work on same data somehow (like different parts of small data window), or just tweaking the algorithm to be more local (like in sorting, merge sort is generally slower than quick sort, but it is more cache friendly which still makes it better in some cases).

like image 1
hyde Avatar answered Oct 17 '22 16:10

hyde