I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to the shared memory and I have developed the following code.
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];
In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2). Thanks in advance.
Coalesced memory access or memory coalescing refers to combining multiple memory accesses into a single transaction. On the K20 GPUs on Stampede, every successive 128 bytes ( 32 single precision words) memory can be accessed by a warp (32 consecutive threads) in a single transaction.
Because CUDA shared memory is located on chip, its memory bandwidth is much larger than the global memory which is located off chip. Therefore, CUDA kernel optimization by caching memory access on shared memory can improve the performance of some operations significantly, especially for those memory-bound operations.
Shared memory is a CUDA memory space that is shared by all threads in a thread block. In this case shared means that all threads in a thread block can write and read to block-allocated shared memory, and all changes to this memory will be eventually available to all threads in the block.
To declare that a variable in a kernel is to be stored in shared memory, use the __shared__ qualifier keyword beside the variable's name in its definition (see Listing 6.1). Listing 6.1 shows two variables declared as __shared__, an integer called i, and a floating point array called f_array.
I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:
#include <stdio.h>
#include <malloc.h>
typedef struct dim3_t{
int x;
int y;
} dim3;
// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16
// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1))
int main(){
// fixed dim3 variables
// grid and block size
dim3 blockDim,gridDim;
blockDim.x=BLOCKDIMX;
blockDim.y=BLOCKDIMY;
gridDim.x=GRIDDIMX;
gridDim.y=GRIDDIMY;
// counters
int unq_accesses=0;
int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
int total_unq_accesses=0;
// iter over total number of threads
// and count the number of memory requests (the coalesced requests)
int I, II, III;
for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
dim3 blockIdx;
blockIdx.x = I%GRIDDIMX;
blockIdx.y = I/GRIDDIMX;
for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
if(II%COALESCINGWIDTH==0){
// new coalescing bunch
total_unq_accesses+=unq_accesses;
unq_accesses=0;
}
dim3 threadIdx;
threadIdx.x=II%BLOCKDIMX;
threadIdx.y=II/BLOCKDIMX;
////////////////////////////////////////////////////////
// Change this section to evaluate different accesses //
////////////////////////////////////////////////////////
// do your indexing here
#define BLOCK_SIZE_X BLOCKDIMX
#define BLOCK_SIZE_Y BLOCKDIMY
#define xdim 32
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
// calculate the accessed location and offset here
// change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
int size = sizeof(double);
//////////////////////////
// End of modifications //
//////////////////////////
printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
// check whether it can be merged with existing requests or not
short merged=0;
for(III=0; III<unq_accesses; III++){
if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
merged=1;
break;
}
}
if(!merged){
// new cache block accessed over this coalescing width
unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
unq_accesses++;
}
}
}
printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}
The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.
To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
and the memory access is:
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.
As GPUs have evolved, the requirements for getting coalesced accesses have become less restrictive. Your description of coalesced accesses is more accurate for the earlier GPU architectures than the more recent ones. In particular, Fermi (compute capability 2.0) significantly loosened the requirements. On Fermi and later, it is not important to access the memory locations consecutively. Instead, focus has shifted to accessing memory with as few memory transactions as possible. On Fermi, global memory transactions are 128 bytes wide. So, when the 32 threads in a warp hit an instruction that performs a load or store, 128-byte transactions will be scheduled to service all the threads in the warp. Performance then depends on how many transactions are necessary. If all the threads access values within a 128-byte area that is aligned to 128 bytes, a single transaction is necessary. If all the threads access values in different 128-byte areas, 32 transactions will be necessary. That would be the worst case scenario for servicing the requests for a single instruction in a warp.
You use one of the CUDA profilers to determine the average for how many transactions were required for servicing the requests. The number should be as close to 1 as possible. Higher numbers mean that you should see if there are opportunities for optimizing the memory accesses in your kernel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With