Why transposing a CUDA grid (but not its threadblocks) still slowdowns computation?

Tags:

EDIT: It seems that, at least in this case, transposing the grid has a negative effect on the L2 Cache Bandwidth. This was obtained from the visual profiler. The reason why is not clear to me yet.

I have come to a GPU computing situation in which require to transpose a CUDA grid. So, if block_{x,y} originally acted on data region d_{x,y}, now it acts on data region d_{y,x}, therefore block_{y,x} would act on data region d_{x,y}. An example is presented in the following figure. enter image description here

It is worth mentioning that threads are not transposed inside each block, that is, once the block is located, the threadIdx.x and threadIdx.y values are used in a normal way for their x and y offsets, respectively.

From what I know, in theory this design should do no harm in performance, as the memory coalescing pattern is still preserved, i.e., threads inside a block are not transposed, it is just the grid that re-arranged its blocks. However I found that when transposing the grid, the kernel runs approx. 2X slower than in the normal case. I made a toy example to illustrate the situation.

➜  transpose-grid ./prog 10000 10000 100 0
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
normal_kernel (100 rep).......done: 0.935132 ms
verifying correctness.........ok
➜  transpose-grid ./prog 10000 10000 100 1
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
transp_kernel (100 rep).......done: 1.980445 ms
verifying correctness.........ok

I would really appreciate any explanation for this issue. Here is the source code to reproduce the behavior.

 // -----------------------------------
 // can compile as nvcc main.cu -o prog
 // -----------------------------------

 #include <cuda.h>
 #include <cstdio>

 #define BSIZE2D 32

 __global__ void normal_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.y*blockDim.y + threadIdx.y;
     const int j = blockIdx.x*blockDim.x + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }

 __global__ void transp_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.x*blockDim.x + threadIdx.y;
     const int j = blockIdx.y*blockDim.y + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }



 int verify(int *hmat, const int m, const int n){
     printf("verifying correctness........."); fflush(stdout);
     for(int i=0; i<m*n; ++i){
         if(hmat[i] != 1){
             fprintf(stderr, "Incorrect value at m[%i,%i] = %i\n", i/n, i%n);
             return 0;
         }
     }
     printf("ok\n"); fflush(stdout);
     return 1;
 }
 int main(int argc, char **argv){
     if(argc != 5){
         printf("\nrun as ./prog m n r t\n\nr = number of repeats\nt = transpose (1 or 0)\n");
         exit(EXIT_FAILURE);
     }
     const int m = atoi(argv[1]);
     const int n = atoi(argv[2]);
     const int r = atoi(argv[3]);
     const int t = atoi(argv[4]);
     const unsigned int size = m*n;
     cudaEvent_t start, stop;
     cudaEventCreate(&start);
     cudaEventCreate(&stop);
     float time;
     int *hmat, *dmat;


     printf("init data....................."); fflush(stdout);
     hmat = (int*)malloc(sizeof(int)*(size));
     for(int i=0; i<size; ++i){
         hmat[i] = 0;
     }
     printf("done: zero matrix of %i rows x %i cols\n", m, n);




     printf("copy data to GPU.............."); fflush(stdout);
     cudaMalloc(&dmat, sizeof(int)*(size));
     cudaMemcpy(dmat, hmat, sizeof(int)*(size), cudaMemcpyHostToDevice);
     printf("done\n");



     printf("preparing grid................"); fflush(stdout);
     dim3 block(BSIZE2D, BSIZE2D, 1);
     dim3 grid;
     // if transpose or not
     if(t){
         grid = dim3((m + BSIZE2D - 1)/BSIZE2D, (n + BSIZE2D - 1)/BSIZE2D, 1);
     }
     else{
         grid = dim3((n + BSIZE2D - 1)/BSIZE2D, (m + BSIZE2D - 1)/BSIZE2D, 1);
     }
     printf("done: block(%i, %i, %i), grid(%i, %i, %i)\n", block.x, block.y, block.z, grid.x, grid.y, grid.z);


     if(t){
         printf("transp_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             transp_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }
     else{
         printf("normal_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             normal_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }


     cudaMemcpy(hmat, dmat, sizeof(int)*size, cudaMemcpyDeviceToHost);
     verify(hmat, m, n);
     exit(EXIT_SUCCESS);
 }

787

asked Feb 08 '18 15:02

Cristobal Navarro

1 Answers

Since I could not find any literature on that topic here is my guess-explanation rather basing on experience (my old problem with memory reading speed).

As you wrote, your example preserves memory coalescing pattern but it is done only on warp level (consecutive 32 threads). But to achieve full speed, there is a need for coalescing on inter-warp level - and here the reason is not clear if such coalescing is actually done or maybe cache&memory works somehow better in this scenario (probably as described here we have better utilization of memory burst mode).

So in your normal_kernel execution not only single warp is coalesced but also warp from the next block(s).

To check it on your example I have modified your code to use different block sizes and here are my results on 1080Ti:

Block size (32, 32) same as yours:

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(320, 320, 1)
normal_kernel (100 rep).......done: 1.020545 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(320, 320, 1)
transp_kernel (100 rep).......done: 1.564084 ms
verifying correctness.........ok

Block size (64, 16) unfortunetly we cannot create 64,64 since #threads limit in one block:

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(64, 16, 1), grid(160, 640, 1)
normal_kernel (100 rep).......done: 1.020420 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(64, 16, 1), grid(160, 640, 1)
transp_kernel (100 rep).......done: 1.205506 ms
verifying correctness.........ok

Block size (128, 8):

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(128, 8, 1), grid(80, 1280, 1)
normal_kernel (100 rep).......done: 1.019547 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(128, 8, 1), grid(80, 1280, 1)
transp_kernel (100 rep).......done: 1.058236 ms
verifying correctness.........ok

I'm not sure if this helps in your particular problem but at least we have some more data to discuss.

183

answered Oct 09 '22 10:10

Krzysztof

Related questions
                            
                                Boost iostreams with bzip - unresolved symbols
                            
                                Boost filesystem incredibly slow?
                            
                                Are comma separated statements considered full statements? (and other diagnostic issues)
                            
                                friend a template specialization without <>
                            
                                What's the compiler's freedom for destructor elision?
                            
                                Are OpenGL ES Vertex Array Objects supported in any Android Emulator?
                            
                                Do I need to port std::move to my kernel?
                            
                                Safe and effective way to put a mutex on a container entry
                            
                                Possible gcc bug with C++14 polymorphic lambdas?
                            
                                Why standard containers use function templates instead of non-template Koenig operators
                            
                                Why does the rule defining the term for a constant expression have to be so cluttered?
                            
                                "illegal thread-local reference to regular symbol" error on a TLS variable
                            
                                std::tan() extremely slow after updating glibc
                            
                                Accept move-only parameter by value or rvalue reference
                            
                                Template arguments deduction for parameter type of function pointer involving non-deduced parameter pack
                            
                                Why does proposed `std::shared_ptr::operator[]` take `std::ptrdiff_t` as an argument
                            
                                How to display an image into an XCB window?
                            
                                C++ jump to other method execution
                            
                                Late destruction of function parameters
                            
                                What is a real world example for "smart pointer to member"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why transposing a CUDA grid (but not its threadblocks) still slowdowns computation?

Tags:

c++

performance

cuda

block

slowdown

Cristobal Navarro

People also ask

1 Answers

Krzysztof

Recent Activity

Donate For Us