I wrote a simple CUDA kernel to perform SAXPY over two column vectors of size 2^18.
I found that my GPU, a Tesla C2070, could run a maximum of 1024 threads per block. Hence, I made my block size X = 1024, Y = 1, Z = 1. I also made my grid size X = 2^18 / 1024, Y = 1, Z = 1. I did this because I wanted to make sure that every single thread per block was being used.
However, I discovered that running the kernel with block sizes of X = 512 and X = 128 consistently resulted in faster times than running the kernel with a block size of X = 1024.
Why is that? Aren't I wasting threads if my block size is less than 1024?
Level 1 BLAS functions like SAXPY are memory bandwidth limited. The operation
y <- alpha * x + y
only performs a single FMAD, but requires two loads and a store from global memory. Your C2070 has about 37.5Gfloat/s of global memory bandwidth and 500 GFMAD/s of single precision arithmetic throughput. So performance is determined by the memory controller, rather than the ALUs. Often reducing the number of threads per block in memory bandwidth limited kernels improves performance because it reduces contention for memory controller and cache resources and increases bandwidth utilisation.
This is probably what is happening with your SAXPY kernel. You should be able to find the optimal blocksize by benchmarking, but my experience is that it will be in the 128-384 threads per block on a Fermi device like your C2070.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With