I have blocks of memory that can be quite large (larger than the L2 cache), and sometimes I must set them to all zero. memset is good in a serial code, but what about parallel code ? Has somebody experience if calling memset from concurrent threads actually speed things up for large arrays ? Or even using simple openmp parallel for loops ?
OpenMP is a library for parallel programming in the SMP (symmetric multi-processors, or shared-memory processors) model. When programming with OpenMP, all threads share memory and data. OpenMP supports C, C++ and Fortran. The OpenMP functions are included in a header file called omp.
The omp parallel sections directive effectively combines the omp parallel and omp sections directives. This directive lets you define a parallel region containing a single sections directive in one step.
If execution of a thread terminates while inside a parallel region, execution of all threads in all teams terminates. The order of termination of threads is unspecified. All work done by a team prior to any barrier that the team has passed in the program is guaranteed to be complete.
People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):
GCC 4.7, code compiled with -O3 -mtune=native -fopenmp
:
Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus
single socket
threads 1st touch rewrite
1 1452.223 MB/s 3279.745 MB/s
2 1541.130 MB/s 3227.216 MB/s
3 1502.889 MB/s 3215.992 MB/s
4 1468.931 MB/s 3201.481 MB/s
(1st touch is slow as the thread team is being created from scratch and the operating system is mapping physical pages into the virtual address space reserved by malloc(3)
)
One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)
1 thread per socket
threads 1st touch rewrite
1 1455.603 MB/s 3273.959 MB/s
2 2824.883 MB/s 5346.416 MB/s
3 3979.515 MB/s 5301.140 MB/s
4 4128.784 MB/s 5296.082 MB/s
Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.
Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)
single socket
threads 1st touch rewrite
1 1469.897 MB/s 3435.087 MB/s
2 2801.953 MB/s 6527.076 MB/s
3 3805.691 MB/s 9297.412 MB/s
4 4647.067 MB/s 10816.266 MB/s
5 5159.968 MB/s 11220.991 MB/s
6 5330.690 MB/s 11227.760 MB/s
At least 5 threads are necessary in order to saturate the bandwidth of one memory link.
1 thread per socket
threads 1st touch rewrite
1 1460.012 MB/s 3436.950 MB/s
2 2928.678 MB/s 6866.857 MB/s
3 4408.359 MB/s 10301.129 MB/s
4 5859.548 MB/s 13712.755 MB/s
5 7276.209 MB/s 16940.793 MB/s
6 8760.900 MB/s 20252.937 MB/s
Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.
Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp
has the GOMP_CPU_AFFINITY
environment variable, Intel has the KMP_AFFINITY
environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.
Edit: For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)
threads 1st touch rewrite
1 2257.699 MB/s 7659.678 MB/s
2 3282.500 MB/s 8157.528 MB/s
3 4109.371 MB/s 8157.335 MB/s
4 4591.780 MB/s 8141.439 MB/s
Why this high speed with even a single thread? A little exploration with gdb
shows that memset(buf, 0, len)
gets translated by the OS X compiler to bzero(buf, len)
and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42
is provided by libc.dylib
and used at run-time. It uses the MOVDQA
instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA
can zero 32 bytes at once and probably saturate the memory link.
The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.
Well, there's always the L3 cache...
However, it's very likely that this will be bound by main memory bandwidth already; adding more parallelism is unlikely to improve things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With