A paper by Donzis & Aditya suggests, that it is possible to use a finite difference scheme that might have a delay in the stencil. What does this mean? A FD scheme might be used to solve the heat equation and reads (or some simplification of it)
u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])
meaning, that the value at the next time step depends on the value at the same position and its neighbours at the previous time step.
This problem can easily be parallized by splitting the (in our case 1D) domain onto the different processors. However, we need communication when computing the boundary nodes at a processor, since the element u[t,i+-1]
is only available on another processor.
The problem is illustrated in the following graphic, which is taken from the cited paper.
An MPI implementation might use MPI_Send
and MPI_Recv
for synchronous computation.
Since the computation itself is fairly easy, it is the communication which might become a possible bottleneck.
A solution to the problem is given in the paper:
Instead of a synchronous process, just take the boundary note that is available, despite the fact that it might be the value of an earlier time step. The method then still converges (under some assumptions)
For my work, I would like to implement the asynchronous MPI case (which is not part of the paper). The synchronous part using MPI_Send
and MPI_Recv
is working correctly. I extended the memory by two elements as ghost cells for the neighbouring elements and send the needed values via send and receive. The code below is basically the implementation of the figure above and is performed during each time step prior to the computation.
MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now, I'm by no means an MPI expert. I figured out, that MPI_Put
might be what I need for the asynchronous case and reading a little bit, I came up with the following implementation.
Before the time loop:
MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);
Inside the time loop:
MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];
which puts the needed elements in the window, namely boundary
(array with two elements) on the neighbouring processors and takes the values u[0]
and u[NpP+1]
from the boundary
array itself.
This implementation is working and I get the same result was with MPI_Send/Recv
. However, this isn't really asynchronous since I'm still using MPI_Win_fence
, which, as far as I understood, ensures synchronization.
The problem is: If I take out the MPI_Win_fence
the values inside boundary
are never updated and stay the inital values. My understanding was, that without MPI_Win_fence
you would take any value that is available inside boundary
which might (or might not) have been updated by a neighbouring processor.
Does anybody have an idea to avoid the use of MPI_Win_fence
while also solving the problem, that the values inside boundary
are never updated?
I'm also not sure, if the code I provided is enough to understand my problem or to give any hints. If that is the case, feel free to ask, as I will try to add all the parts that are missing.
The following works seems to work for me, in the sense of correct execution - a small 1d heat equation taken from one of our tutorials, using for the RMA stuff:
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
In the code I have even ranks wait an extra 10ms each time step to try to make sure that things get out of sync; but looking at traces it actually looks like things remain pretty synced up. I don't know if that high degree of synchrony can be fixed by tweaking the code, or is a restriction of the implementation (IntelMPI 5.0.1), or just happens because the amount of time passing in computation is too little and communication time is dominating (but as to the last, cranking up the usleep interval doesn't seem to have an effect).
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
#ifdef ONESIDED
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_unlock( rank, leftwin );
MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );
#else
temperature[current][0] = fixedlefttemp;
temperature[current][locpoints+1] = fixedrighttemp;
/* send data rightwards */
MPI_Sendrecv(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, righttag,
&(temperature[current][0]), 1, MPI_FLOAT, left, righttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* send data leftwards */
MPI_Sendrecv(&(temperature[current][1]), 1, MPI_FLOAT, left, lefttag,
&(temperature[current][locpoints+1]), 1, MPI_FLOAT, right, lefttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
This is a clone of Jonathen Dursi's post, but with changes for MPI-3 RMA synchronization...
#define _BSD_SOURCE /* usleep */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char **argv) {
/* simulation parameters */
const int totpoints=1000;
int locpoints;
const float xleft = -12., xright = +12.;
float locxleft, locxright;
const float kappa = 1.;
const int nsteps=100;
/* data structures */
float *x;
float **temperature;
/* parameters of the original temperature distribution */
const float ao=1., sigmao=1.;
float fixedlefttemp, fixedrighttemp;
int current, new;
int step, i;
float time;
float dt, dx;
float rms;
int rank, size;
int start,end;
int left, right;
int lefttag=1, righttag=2;
/* MPI Initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
locpoints = totpoints/size;
start = rank*locpoints;
end = (rank+1)*locpoints - 1;
if (rank == size-1)
end = totpoints-1;
locpoints = end-start+1;
left = rank-1;
if (left < 0) left = MPI_PROC_NULL;
right= rank+1;
if (right >= size) right = MPI_PROC_NULL;
#ifdef ONESIDED
if (rank == 0)
printf("Onesided: Allocating windows\n");
MPI_Win leftwin, rightwin;
float *leftgc, *rightgc;
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc, &leftwin);
MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, leftwin);
MPI_Win_lock_all(MPI_MODE_NOCHECK, rightwin);
#endif
/* set parameters */
dx = (xright-xleft)/(totpoints-1);
dt = dx*dx * kappa/10.;
locxleft = xleft + start*dx;
locxright = xleft + end*dx;
x = (float *)malloc((locpoints+2)*sizeof(float));
temperature = (float **)malloc(2 * sizeof(float *));
temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
current = 0;
new = 1;
/* setup initial conditions */
time = 0.;
for (i=0; i<locpoints+2; i++) {
x[i] = locxleft + (i-1)*dx;
temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
}
fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
#ifdef ONESIDED
*leftgc = fixedlefttemp;
*rightgc = fixedrighttemp;
#endif
/* evolve */
for (step=0; step < nsteps; step++) {
/* boundary conditions: keep endpoint temperatures fixed. */
/* RMA code assumes no conflicts in updates via MPI_Put.
If that is wrong, hopefully it is fine to use MPI_Accumulate
with MPI_SUM to accumulate the result. */
#ifdef ONESIDED
MPI_Put(&(temperature[current][1]), 1, MPI_FLOAT, left, 0, 1, MPI_FLOAT, rightwin);
MPI_Win_flush( left, rightwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_flush( right, leftwin );
temperature[current][0] = *leftgc;
MPI_Win_flush( rank, leftwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_flush( rank, rightwin );
#else
#error Define ONESIDED...
#endif
for (i=1; i<locpoints+1; i++) {
temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
(temperature[current][i+1] - 2.*temperature[current][i] +
temperature[current][i-1]) ;
}
time += dt;
if ((rank % 2) == 0)
usleep(10000u);
current = new;
new = 1 - current;
}
rms = 0.;
for (i=1;i<locpoints+1;i++) {
rms += (temperature[current][i])*(temperature[current][i]);
}
float totrms;
MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
totrms = sqrt(totrms/totpoints);
printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
}
#ifdef ONESIDED
MPI_Win_unlock_all(leftwin);
MPI_Win_unlock_all(rightwin);
MPI_Win_free(&leftwin);
MPI_Win_free(&rightwin);
#endif
free(temperature[1]);
free(temperature[0]);
free(temperature);
free(x);
MPI_Finalize();
return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With