I seem to recall getting the hint that I should try to avoid using char's in CUDA kernels, because of the SMs liking of 32-bit integers. Is there some speed penalty for using them? For example, is it slower to do
int a[4];
int b = a[0] + a[1] + a[2] + a[3];
a[1] = a[3];
a2[0] = a[0]
than
char a[4];
char b = a[0] + a[1] + a[2] + a[3];
a[1] = a[3];
a2[0] = a[0]
in kernel code?
Notes:
Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed.
Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.
In order to launch a CUDA kernel we need to specify the block dimension and the grid dimension from the host code. I'll consider the same Hello World! code considered in the previous article. In the above code, to launch the CUDA kernel two 1's are initialised between the angle brackets.
A quick note up front: In C/C++ the signedness of char
is implementation defined. When using char
to perform 8-bit integer arithmetic, it is therefore highly advisable to use signed char
or unsigned char
specifically as required by the computation.
A negative performance impact from using char
types in CUDA is likely. I would not advise the use of char
types unless memory size constraints (including shared memory size limitations) or the nature of the computation specifically require it.
CUDA is a C++ derived language that follows basic C++ language specifications. C++ (and C) specifies that in an expression data of a type narrower than int
must be widened to int
before entering the computation. Unless the integer instructions of the underlying hardware come with built-in conversion, this implies that additional conversion instructions are needed, which will increase dynamic instruction count and likely lower performance.
Note that compilers are allowed to deviate from the abstract C++ execution model under the "as-if" rule: As long as the resulting code behaves as if it follows the abstract model, i.e., its semantics are identical, it is allowed to eliminate these conversion operations. My recent experiments suggest that the CUDA 6.5 compiler is applying such optimizations aggressively and is therefore able to eliminate most conversion operations either outright or by merging them into other instructions.
However, this is not always possible. A simple contrived example is the following kernel, which contains an additional conversion instruction I2I.S32.S8
when instantiated with T = char
versus T = int
. I verified this by running cuobjdump --dump-sass
on the executable to dump the machine code.
template <class T>
__global__ void kernel (T *out, const T *in)
{
int tid = threadIdx.x;
if (threadIdx.x < 128) {
T foo = 5 * in[tid] + 7 * in[tid+1];
out [tid] = foo * foo;
}
}
Besides increased instruction count, negative performance impact from use of char
types can also result due to lower memory bandwidth. The design of the GPU's memory subsystem is such that total achievable global memory bandwidth generally increases with the width of the accesses. One possible explanation for this is the finite depth of the internal queues that track memory accesses, but there may be other factors at work.
Where char
types naturally occur due to the nature of a use case, such as image processing, one would want to look into the use of 32-bit compound types such as uchar4
. The use of the wider type during load and store operations allows for improved memory bandwidth. CUDA has SIMD intrinsics for manipulating packed char
data, and using those can beneficially reduce dynamic instruction count. Note that the SIMD intrinsics are fully backed by hardware only on Kepler GPUs, are fully emulated on Fermi CPUs, and are partially emulated on Maxwell GPUs. I have seen anecdotal evidence that even the emulated versions can still provide a performance benefit compared to handling each byte separately. I would suggest verifying that in the context of any particular use case.
There is the also a very brief reference to this issue in section 11.1.3 of the CUDA Best Practices Guide:
The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for...
- functions operating on char or short whose operands generally need to be converted to an int.
- ...
It's not possible to say in the generic sense whether it'll be faster/slower/unchanged, though usually I'd not expect much difference. You're correct in saying that arithmetic for chars will be in 32-bit, but whether this requires a type conversion will depend on the problem. In the question's example, I'd expect to see the compiler store a
and b
in 32-bit registers, and in my experiments around this problem (note, without a full reproducing case it's hard to guarantee this) I didn't see a difference in SASS. For the region of the code where everything is done in registers I wouldn't expect a performace hit.
There is a impact, however, as the char
variables are moved two and from memory. As the char
will have to be cast into a 32 bit register before use this will incur additional instructions. This may be a considerable impact, or may not be.
Now, there are also some edge cases which may make a difference. The compiler might be able to pack multiple char
s into a register and extract them with arithmetic (register saving vs arithmetic cost). You may even be able to force this behaviour using unions. Whether the saving is worth the instructions will vary on a case-by-case basis. I can't think of any others which would incur significant casting overhead at the moment.
Rather obviously if you can store you variables in 1 byte instead of 4 you're going to get a 4x saving in memory and bandwidth required. There are things to consider though:
I don't know of any significant reasons not to use char
if possible instead of int
for arithmetic where everything lies in registers, though you will pay a conversion cost when reading/writing to memory. Storing an array as char
instead of int
should, if you're careful, give both a bandwidth and space saving.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With