I have being trying to write(update) in an array of doubles in a parallel way using OpenMP. The idea is that the element that needs to be updated can be updated more than once and the element itself is computed on-the-fly. This makes it very prone to race conditions unless I "lock" the memory region correspondent to the element that is being updated with an atomic operation. As the example below:
#include<omp.h>
#include<stdlib.h>
int main()
{
double *x=NULL, contribution = 1234.5;
size_t xlen=1000, i, n;
if ( (x = (double *) calloc(xlen,sizeof(double)) ) == NULL) exit(-1)
#pragma omp parallel for shared(x) private(n,contribution)
for (i=0; i<xlen; i++) {
n = find_n_on_the_fly(i);
#pragma omp atomic update
x[n]+=contribution; // in the more complicated case contributions are different
}
return 0;
}
I am running into race conditions with this approach still. I tried using critical sections but it completely kills me since the arrays are large and the number of updates is also large.
Question: what is wrong with this approach? Is there a better way of handling this?
Note: To handle this I am doing something stupid, by creating copies of the arrays for each one of the threads and reducing them later. But memory limitations don't allow me to go further.
To me, above code seems efficient.
However, you have two options to improve it:
1- Using a memory allocation function called _mm_malloc
with cache line size as one of the input instead of calloc
that you used. The biggest thing that you are facing right now is False Sharing. In order to omit the effects of FS, by using above method, you are basically forcing the underlying library to allocate memory (or your array) in a way that each element is residing in a line on cache. This way, threads will not fight over a cache line to retrieve two different mutual variables. In another words, it will prevent allocation of two variables in a cache line. This will boost performance in a multi-threaded program. Google _mm_malloc
for more info. However, following are the basic usage. In most modern computers, the cache line is 64.
#if defined(__INTEL_COMPILER)
#include <malloc.h>
#else
#include <mm_malloc.h>
#endif
int CPU_Cache_line_size = 64;
int xlen = 100;
double *ptr = _mm_malloc(xlen*sizeof(double), CPU_Cache_line_size);
/*
do something ...
*/
_mm_free(ptr);
__CPU_CACHE_LINE_SIZE__
can be queried for your system in following ways:
Linux command:
getconf LEVEL1_DCACHE_LINESIZE
Programatically:
int CPU_Cache_line_size = sysconf (_SC_LEVEL1_DCACHE_LINESIZE);
Detailed information regarding cache:
/sys/devices/system/cpu/cpu0/cache/
2- You mentioned this yourself but the limitations are severe for your situation: using an array per thread and then reduce them. This approach is more efficient. Consider again to use this approach if you can.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With