I google around a bit, but this is not clear to me now whether some GPUs programmed with CUDA can take advantage or can use instructions similar to those from SSE SIMD extensions; for instance whether we can sum up two vectors of floats in double precission, each one with 4 values. If so, I wonder whether it would be better to use more lighter threads for each of the previous 4 values of the vector or use SIMD.
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation.
Using the CUDA Toolkit you can accelerate your C or C++ applications by updating the computationally intensive portions of your code to run on GPUs. To accelerate your applications, you can call functions from drop-in libraries as well as develop custom applications using languages including C, C++, Fortran and Python.
CUDA programs compile to the PTX instruction set. That instruction set does not contain SIMD instructions. So, CUDA programs cannot make explicit use of SIMD.
However, the whole idea of CUDA is to do SIMD on a grand scale. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions (although some of the instructions may be suppressed for some threads, giving the illusion of different execution sequences). NVidia call it Single Instruction, Multiple Thread (SIMT), but it's essentially SIMD.
As was mentioned in a comment to one of the replies, NVIDIA GPU has some SIMD instructions. They operate on unsigned int
on per-byte and per-halfword basis. As of July 2015, there are several flavours of the following operations:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With