In an OpenMP parallel code, would there be any benefit for memset to be run in parallel?

Tags:

I have blocks of memory that can be quite large (larger than the L2 cache), and sometimes I must set them to all zero. memset is good in a serial code, but what about parallel code ? Has somebody experience if calling memset from concurrent threads actually speed things up for large arrays ? Or even using simple openmp parallel for loops ?

548

asked Jul 20 '12 09:07

nat chouf

2 Answers

People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):

GCC 4.7, code compiled with -O3 -mtune=native -fopenmp:

Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus

single socket

threads   1st touch      rewrite
1         1452.223 MB/s  3279.745 MB/s
2         1541.130 MB/s  3227.216 MB/s
3         1502.889 MB/s  3215.992 MB/s
4         1468.931 MB/s  3201.481 MB/s

(1st touch is slow as the thread team is being created from scratch and the operating system is mapping physical pages into the virtual address space reserved by malloc(3))

One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)

1 thread per socket

threads   1st touch      rewrite
1         1455.603 MB/s  3273.959 MB/s
2         2824.883 MB/s  5346.416 MB/s
3         3979.515 MB/s  5301.140 MB/s
4         4128.784 MB/s  5296.082 MB/s

Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.

Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)

single socket

threads   1st touch      rewrite
1         1469.897 MB/s  3435.087 MB/s
2         2801.953 MB/s  6527.076 MB/s
3         3805.691 MB/s  9297.412 MB/s
4         4647.067 MB/s  10816.266 MB/s
5         5159.968 MB/s  11220.991 MB/s
6         5330.690 MB/s  11227.760 MB/s

At least 5 threads are necessary in order to saturate the bandwidth of one memory link.

1 thread per socket

threads   1st touch      rewrite
1         1460.012 MB/s  3436.950 MB/s
2         2928.678 MB/s  6866.857 MB/s
3         4408.359 MB/s  10301.129 MB/s
4         5859.548 MB/s  13712.755 MB/s
5         7276.209 MB/s  16940.793 MB/s
6         8760.900 MB/s  20252.937 MB/s

Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.

Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp has the GOMP_CPU_AFFINITY environment variable, Intel has the KMP_AFFINITY environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.

Edit: For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)

threads   1st touch      rewrite
1         2257.699 MB/s  7659.678 MB/s
2         3282.500 MB/s  8157.528 MB/s
3         4109.371 MB/s  8157.335 MB/s
4         4591.780 MB/s  8141.439 MB/s

Why this high speed with even a single thread? A little exploration with gdb shows that memset(buf, 0, len) gets translated by the OS X compiler to bzero(buf, len) and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42 is provided by libc.dylib and used at run-time. It uses the MOVDQA instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA can zero 32 bytes at once and probably saturate the memory link.

The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.

137

answered Oct 08 '22 23:10

Hristo Iliev

Well, there's always the L3 cache...

However, it's very likely that this will be bound by main memory bandwidth already; adding more parallelism is unlikely to improve things.

answered Oct 08 '22 23:10

Oliver Charlesworth

Related questions
                            
                                Why compilers no longer optimize this UB with strict aliasing
                            
                                C: Return value via stack/register question
                            
                                Making recvfrom() function non-blocking
                            
                                SDL embed image inside program executable
                            
                                How to call a C++ constructor from a C-File
                            
                                GCC define function-like macros using -D argument
                            
                                Are there any platforms where using structure copy on an fd_set (for select() or pselect()) causes problems?
                            
                                How to make gcc generate only machine code that can be loaded directly into memory and executed?
                            
                                Implementing a C API in D
                            
                                Something like Apache Zookeeper with no java?
                            
                                Error: Conversion to non-scalar type requested
                            
                                Does int a=1, b=a++; invoke undefined behavior?
                            
                                Writing unit tests for C code
                            
                                gcc failing to warn of uninitialized variable
                            
                                Pointer operations and operator precedence in C
                            
                                PaxHeaders in tarball
                            
                                getopt fails to detect missing argument for option
                            
                                Lemon power or not?
                            
                                Can memset() be called with a null pointer if the size is 0?
                            
                                Char or Int for boolean value in C?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In an OpenMP parallel code, would there be any benefit for memset to be run in parallel?

Tags:

performance

c

openmp

nat chouf

People also ask

2 Answers

Hristo Iliev

Oliver Charlesworth

Recent Activity

Donate For Us