I would like to write an electromagnetic 2D Finite Difference Time Domain (FDTD) code in CUDA language. The C code for the update of the magnetic field is the following
// --- Update for Hy and Hx
for(int i=n1; i<=n2; i++)
for(int j=n11; j<=n21; j++){
Hy[i*ydim+j]=A[i*ydim+j]*Hy[i*ydim+j]+B[i*ydim+j]*(Ezx[(i+1)*ydim+j]-Ezx[i*ydim+j]+Ezy[(i+1)*ydim+j]-Ezy[i*ydim+j]);
Hx[i*ydim+j]=G[i*ydim+j]*Hx[i*ydim+j]-H[i*ydim+j]*(Ezx[i*ydim+j+1]-Ezx[i*ydim+j]+Ezy[i*ydim+j+1]-Ezy[i*ydim+j]);
}
}
My first parallelization attempt has been the following kernel:
__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h[(idx+1)*ydim+idy]-Ezx_h[idx*ydim+idy]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h[idx*ydim+idy+1]-Ezx_h[idx*ydim+idy]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }
}
However, by also using the Visual Profiler, I have been unsatisfied by this solution for two reasons: 1) The memory accesses are poorly coalesced; 2) The shared memory is not used.
I then decided to use the following solution
__global__ void H_update_kernel(double* Hx_h, double* Hy_h, double* Ezx_h, double* Ezy_h, double* A_h, double* B_h,double* G_h, double* H_h, int n1, int n2, int n11, int n21)
{
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
__shared__ double Ezy_h_shared[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
if (((blockIdx.x*BLOCK_SIZE_X+i1)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j1)<ydim))
Ezx_h_shared[i1][j1]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i1)*ydim+(blockIdx.y*BLOCK_SIZE_Y+j1)];
if (((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))&&(((blockIdx.x*BLOCK_SIZE_X+i2)<xdim)&&((blockIdx.y*BLOCK_SIZE_Y+j2)<ydim)))
Ezx_h_shared[i2][j2]=Ezx_h[(blockIdx.x*BLOCK_SIZE_X+i2)*xdim+(blockIdx.y*BLOCK_SIZE_Y+j2)];
__syncthreads();
if ((idx <= n2 && idx >= n1)&&(idy <= n21 && idy >= n11)) {
Hy_h[idx*ydim+idy]=A_h[idx*ydim+idy]*Hy_h[idx*ydim+idy]+B_h[idx*ydim+idy]*(Ezx_h_shared[i+1][j]-Ezx_h_shared[i][j]+Ezy_h[(idx+1)*ydim+idy]-Ezy_h[idx*ydim+idy]);
Hx_h[idx*ydim+idy]=G_h[idx*ydim+idy]*Hx_h[idx*ydim+idy]-H_h[idx*ydim+idy]*(Ezx_h_shared[i][j+1]-Ezx_h_shared[i][j]+Ezy_h[idx*ydim+idy+1]-Ezy_h[idx*ydim+idy]); }
}
The index trick is needed to make a block of BS_x * BS_y threads read (BS_x+1)*(BS_y+1) global memory locations to the shared memory. I believe that this choice is better than the previous one, due to the use of the shared memory, although not all the accesses are really coalesced, see
Analyzing memory access coalescing of my CUDA kernel
My question is that if anyone of you can address me to a better solution in terms of coalesced memory access. Thank you.
Thank you for providing the profiling information.
Your second algorithm is better because you are getting a higher IPC. Still, on CC 2.0, max IPC is 2.0, so your average in the second solution of 1.018 means that only half of your available compute power is utilized. Normally, that means that your algorithm is memory bound, but I'm not sure in your case, because almost all the code in your kernel is inside if conditionals. A significant amount of warp divergence will affect performance, but I haven't checked if instructions which results are not used count towards the IPC.
You may want to look into reading through the texture cache. Textures are optimized for 2D spatial locality and better support semi-random 2D access. It may help your [i][j] type accesses.
In the current solution, make sure that it's the Y position ([j]) that changes the least between two threads with adjacent thread IDs (to keep memory accesses as sequential as possible).
It could be that the compiler has optimized this for you, but you recalculate idx*ydim+idy many times. Try calculating it once and reusing the result. That would have more potential for improvement if your algorithm was compute bound though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With