Why would parallelization decrease performance so dramatically?

Tags:

I have an OpenMP program (thousands of lines, impossible to reproduce here) that works as follows:

It consists of worker threads along with a task queue.
A task consists of a convolution; every time a worker thread pops off a task from the work queue, it performs the required convolution and optionally pushes more convolutions onto the queue.
(There is no specific "master" thread; all workers are equal.)

When I run this program on my own machine (4-core HT non-NUMA Core i7), the running times I get are:

(#threads: running time)
 1: 5374 ms
 2: 2830 ms
 3: 2147 ms
 4: 1723 ms
 5: 1379 ms
 6: 1281 ms
 7: 1217 ms
 8: 1179 ms

This makes sense.

However, when I run it on a NUMA 48-core AMD Opteron 6168 machine, I get these running times:

 1: 9252 ms
 2: 5101 ms
 3: 3651 ms
 4: 2821 ms
 5: 2364 ms
 6: 2062 ms
 7: 1954 ms
 8: 1725 ms
 9: 1564 ms
10: 1513 ms
11: 1508 ms
12: 1796 ms  <------ why did it get worse?
13: 1718 ms
14: 1765 ms
15: 2799 ms  <------ why did it get *so much* worse?
16: 2189 ms
17: 3661 ms
18: 3967 ms
19: 4415 ms
20: 3089 ms
21: 5102 ms
22: 3761 ms
23: 5795 ms
24: 4202 ms

These results are pretty consistent, it's not an artifact of load on the machine.
So I don't understand:
What could cause the performance to drop so much after 12 cores?

I would understand if the performance saturated at some level (I could blame it on limited memory bandwidth), but I don't understand how it can drop from 1508 ms to 5795 ms by adding more threads.

How is this possible?

505

asked Feb 12 '14 23:02

user541686

1 Answers

These sort of situations can be quite hard to figure out. One key is to look at memory locality. Without seeing your code, it's impossible to say EXACTLY what is going wrong, but we can discuss some of the things that amke "multithreading less good":

In all NUMA systems, when the memory is located with processor X and the code running on processor Y (where X & Y aren't the same processor), every memory access will be bad for performance. So, allocating memory on the right NUMA node will certainly help. (This may require some special code, such as setting affinity masks and at least hinting to the OS/Runtime Systems that you want Numa-aware allocations). At the very least, ensure that you don't simply work on one large array that is allocated by the "first thread, then start lots more threads".

Another thing that is even worse is sharing or false sharing of memory - so if two or more processors are using the same cache-line, you will get a ping-pong match between those two processors, where each processor will do "I want memory at address A", get hold of the memory content, update it, and then the next processor will do the same thing.

The fact that results gets bad just at 12 threads seem to indicate that it's to do with "sockets" - either you are sharing data, or the data is located "on the wrong node". At 12 threads, it's likely that you start using the second socket (more), which will make these sort of problems more apparent.

For best performance, you need memory to be allocated on the local node, no sharing, and no locking. Your first set of results also look like they are not "ideal". I have some (absolutely non-sharing) code that gives exactly n-times better for number of processors, until I run out of processors (unfortunately, my machine only has 4 cores, so it's not very much better, but it's still 4x better than 1 core, and if I ever got my hands on a 48 or 64-core machine, it would produce 48 or 64 better results in calculating "weird numbers").

Edit:

The "Socket issue" is two things:

Memory locality: Basically, memory is attached to each socket, so if the memory is allocated from the region belonging to the "previous" socket, then you get extra latency reading the memory.
Cache/sharing: Within a processor, there are "fast" links to share data (and often a "bottom level shared cache", e.g. L3 cache), which allows for the cores within a socket to share data more efficiently than with those in a different socket.

All this amounts to something like working on servicing cars, but you don't have your own toolbox, so every time you need a tool, you have to ask your colleague next to you for a screwdriver, 15mm spanner, or whatever you need. And then give the tools back when your work area gets a bit full. It's not a very efficient way of working... It would be much better if you had tools of your own (at least the most common one - one of those special spanners that you only use once a month isn't a big issue, but your common 10, 12 and 15mm spanners and a few screwdrivers, for sure). And of course, it would get even worse if there are four mechanics, all sharing the same toolbox. This is the case where you have "all memory allocated on one node" in a four socket system.

Now imagine that you have a "box of spanners", and only one of the mechanics can use the box of spanners, so if you need a 12mm spanner, you have to wait for the guy next to you to finish using the 15mm spanner. This is what happens if you have "false cache-sharing" - the processor isn't really using the same value, but because there are more than one "thing" in the cacheline, the processors are sharing the cacheline (box of spanners).

108

answered Oct 06 '22 00:10

Mats Petersson

Related questions
                            
                                Tool to automatically create a class diagram for Qt application
                            
                                Keyword struct before constructor name
                            
                                Correct way to declare/define custom cout-like object
                            
                                How to convert a range subset of bits in a C++ bitset to a number? [duplicate]
                            
                                When should std::atomic_compare_exchange_strong be used?
                            
                                Friend function from a templated class
                            
                                localtime_r support on MinGW
                            
                                C++ std::regex Regular Expressions Finding multiple matches
                            
                                Why can I call base template class method from derived class
                            
                                Elegant approach to error logging in C++11 application?
                            
                                .pro extension in C++ project
                            
                                Conversion from char* to signed char*
                            
                                Convert binary bitset to hexadecimal (C++)
                            
                                Is there a way to declare multiple function arguments with one type?
                            
                                C program with minimum RAM
                            
                                How to find an element in a boost::fusion::vector at runtime?
                            
                                GLSL: Replace large uniform int array with buffer or texture
                            
                                When an object provides both `operator!` and `operator bool`, which is used in the expression `!obj`?
                            
                                Access anonymous enumerated value with dot
                            
                                Why doesn't C++ make it easier to make compile time queries on types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why would parallelization decrease performance so dramatically?

Tags:

c++

performance

multithreading

parallel-processing

user541686

People also ask

1 Answers

Mats Petersson

Recent Activity

Donate For Us