Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding False Sharing in OpenMP with arrays

I have started learning how to use OpenMP as part of a University course. As a lab excercise, we have been given a serial program which we need to parallelize.

One of the first things we were made aware of the dangers of False Sharing, especially when it comes to updating arrays in parallel for loops.

However, I am finding it hard transforming the following snippet of code into a parallizable task without causing False Sharing:

int ii,kk;

double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;

double dx = 1.0/(double)NX;
double dt = 0.5*dx*dx;

// Initialise both arrays with values
init(uk, ukp1);

for(kk=0; kk<NSTEPS; kk++) {
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

My first reaction was to try out sharing ukp1:

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for shared(ukp1)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
    }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

But this clearly shows significant slow down compared to the serial version. The obvious reason is that False sharing is occuring during some write operations to ukp1.

I was under the impression that maybe I could use the reduction clause, however I soon found out that this cannot be used on arrays.

Is there anything I can use to parallelize this code to improve the runtime? Is there a clause I can use which I have not heard of? Or is this the kind of task where I need to restructure the code to allow proper parallization?

All forms of input would be greatly appreciated!

EDIT: It was pointed out to me there was a mistake in my code. The code I have locally is correct, I just edited it wrongly (which changed the structure of the code), sorry for the confusion!

EDIT2:

Some information pointed out to me by @Sergey which I feel is useful:

  • Setting uk or ukp1 to private will essentially have the same effect as setting them to shared due to the fact they are both pointers to the same memory location

  • Using static scheduling should help in theory, but I am am still experiencing the same slowdown. Also, I feel that static scheduling is not the most portable way of fixing this problem.

like image 299
Michael Aquilina Avatar asked Oct 09 '13 17:10

Michael Aquilina


People also ask

How do you reduce false sharing?

In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.

What is false sharing in multi threading?

"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line.

What is false sharing in distributed system?

False sharing occurs when processors in a shared-memory parallel system make references to different data objects within the same coherence block (cache line or page), thereby inducing unnecessary coherence operations.


2 Answers

Since we are talking about optimisations first things first:

Define constants as macros, allows for better optimisation by the compiler.

#define dx (1.0/(double)NX)
#define dt (0.5*dx*dx)

When workring with OpenMP the default sharing rule for variables is shared, though it is good practice to set it to none and enable every variable you need inside the parallel section manually. This way you can be certain that you avoid conflicts.

#pragma omp parallel for default(none) shared(ukp1, uk)

Also setting ukp1 or uk to any sharing status will only pass the pointer into your parallel section since you declared them as pointers. So the memory in them is still shared.

Lastly to avoid flase sharing you want to make sure that as few cache lines are shared among threads as possible. Read only cachelines (thus uk in your case) are irrelevant, they can exist in every thread, but write cachelines ukp1 should be per thread. Today cache lines are normally 64 bytes long - thus one cache line will fit 8 doubles. so you want to assign chunks of at least 8 iterations per thread:

#pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)

Will deploy your code 8 iterations per chunk and should appear in the inner loop.

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
   // Swap pointers for the next time step
   temp = ukp1;
   ukp1 = uk;
   uk   = temp;
}

In practice depending on the size of your data you may want to assign even larger chunk sizes. I tend to use 0x1000 - which on most systems will even fit a whole page (assuming you are page-aligned).

Edit:

In order for this to actually have an effect you need to have your memory correctly aligned. You are starting at index 1, so:

 double *uk = memalign(0x40 , sizeof(double) * (NX + 8));
 double *ukp1 = memalign(0x40 , sizeof(double) * (NX + 8));
 uk += 7;
 ukp1 += 7;

Now ukp1[1] is cache-line aligned. Increasing your chunk size may help, but unless you plan on having NX > 100000 there isn't much point in parallelising in the first place.

You need to keep in mind that you are getting quite a lot of overhead restarting the parallel section in each iteration. To get that under control you would need to rework your scheduling beyond simple OpenMP.

like image 65
Sergey L. Avatar answered Nov 20 '22 18:11

Sergey L.


I believe that @SergeyL. is right, and your code should look more like this:

// Parabolic 1D heat diffusion solved with an explicit method
for(kk=0; kk<NSTEPS; kk++) {
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
   // Swap pointers for the next time step
   temp = ukp1;
   ukp1 = uk;
   uk   = temp;
}

That said to avoid false sharing you must ensure that different threads are not operating on the same cache line. This indeed requires you to choose an appropriate scheduling and chunk size. The simplest solution that comes to mind is:

// Parabolic 1D heat diffusion solved with an explicit method
#pragma omp parallel private(kk)
{
for(kk=0; kk<NSTEPS; kk++) {
#pragma omp for schedule(static)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
#pragma omp single
   {
       // Swap pointers for the next time step
       temp = ukp1;
       ukp1 = uk;
       uk   = temp;
   }
} // outer for loop
} // pragma omp parallel
like image 20
Massimiliano Avatar answered Nov 20 '22 20:11

Massimiliano