I seem to recall getting the hint that I should try to avoid using char's in CUDA kernels, because of the SMs liking of 32-bit integers. Is there some speed penalty for using them? For example, is it slower to do <pre class="prettyprint"><code>int a[4]; int b = a[0] + a[1] + a[2] + a[3]; a[1] = a[3]; a2[0] = a[0] </code></pre> than <pre class="prettyprint"><code>char a[4]; char b = a[0] + a[1] + a[2] + a[3]; a[1] = a[3]; a2[0] = a[0] </code></pre> in kernel code? Notes: <ul> <li>I'm interested in the penalty/ies for doing arithmetic with char values, performing comparisons, and reading and writing them to memory.</li> </ul>

A quick note up front: In C/C++ the signedness of <code>char</code> is implementation defined. When using <code>char</code> to perform 8-bit integer arithmetic, it is therefore highly advisable to use <code>signed char</code> or <code>unsigned char</code> specifically as required by the computation. A negative performance impact from using <code>char</code> types in CUDA is likely. I would not advise the use of <code>char</code> types unless memory size constraints (including shared memory size limitations) or the nature of the computation specifically require it. CUDA is a C++ derived language that follows basic C++ language specifications. C++ (and C) specifies that in an expression data of a type narrower than <code>int</code> must be widened to <code>int</code> before entering the computation. Unless the integer instructions of the underlying hardware come with built-in conversion, this implies that additional conversion instructions are needed, which will increase dynamic instruction count and likely lower performance. Note that compilers are allowed to deviate from the abstract C++ execution model under the "as-if" rule: As long as the resulting code behaves as if it follows the abstract model, i.e., its semantics are identical, it is allowed to eliminate these conversion operations. My recent experiments suggest that the CUDA 6.5 compiler is applying such optimizations aggressively and is therefore able to eliminate most conversion operations either outright or by merging them into other instructions. However, this is not always possible. A simple contrived example is the following kernel, which contains an additional conversion instruction <code>I2I.S32.S8</code> when instantiated with <code>T = char</code> versus <code>T = int</code>. I verified this by running <code>cuobjdump --dump-sass</code> on the executable to dump the machine code. <pre class="prettyprint"><code>template <class T> __global__ void kernel (T *out, const T *in) { int tid = threadIdx.x; if (threadIdx.x < 128) { T foo = 5 * in[tid] + 7 * in[tid+1]; out [tid] = foo * foo; } } </code></pre> Besides increased instruction count, negative performance impact from use of <code>char</code> types can also result due to lower memory bandwidth. The design of the GPU's memory subsystem is such that total achievable global memory bandwidth generally increases with the width of the accesses. One possible explanation for this is the finite depth of the internal queues that track memory accesses, but there may be other factors at work. Where <code>char</code> types naturally occur due to the nature of a use case, such as image processing, one would want to look into the use of 32-bit compound types such as <code>uchar4</code>. The use of the wider type during load and store operations allows for improved memory bandwidth. CUDA has SIMD intrinsics for manipulating packed <code>char</code> data, and using those can beneficially reduce dynamic instruction count. Note that the SIMD intrinsics are fully backed by hardware only on Kepler GPUs, are fully emulated on Fermi CPUs, and are partially emulated on Maxwell GPUs. I have seen anecdotal evidence that even the emulated versions can still provide a performance benefit compared to handling each byte separately. I would suggest verifying that in the context of any particular use case. There is the also a very brief reference to this issue in section 11.1.3 of the CUDA Best Practices Guide: <blockquote> The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for... <ul> <li>functions operating on char or short whose operands generally need to be converted to an int.</li> <li>...</li> </ul> </blockquote>

<h3>Arithmetic</h3> It's not possible to say in the generic sense whether it'll be faster/slower/unchanged, though usually I'd not expect much difference. You're correct in saying that arithmetic for chars will be in 32-bit, but whether this requires a type conversion will depend on the problem. In the question's example, I'd expect to see the compiler store <code>a</code> and <code>b</code> in 32-bit registers, and in my experiments around this problem (note, without a full reproducing case it's hard to guarantee this) I didn't see a difference in SASS. For the region of the code where everything is done in registers I wouldn't expect a performace hit. There is a impact, however, as the <code>char</code> variables are moved two and from memory. As the <code>char</code> will have to be cast into a 32 bit register before use this will incur additional instructions. This may be a considerable impact, or may not be. Now, there are also some edge cases which may make a difference. The compiler might be able to pack multiple <code>char</code>s into a register and extract them with arithmetic (register saving vs arithmetic cost). You may even be able to force this behaviour using unions. Whether the saving is worth the instructions will vary on a case-by-case basis. I can't think of any others which would incur significant casting overhead at the moment. <h3>Memory</h3> Rather obviously if you can store you variables in 1 byte instead of 4 you're going to get a 4x saving in memory and bandwidth required. There are things to consider though: <ol> <li> Shared memory. Current shared bank sizes are either 4 bytes or 8 bytes. Unless you're reading with transactions of at least 4/8 bytes per thread, you cannot achieve peak shared memory bandwidth. There's also bank conflicts to consider with smaller transactions. A 1 byte read with a stride of the bank size will avoid these bank conflicts, but increase your memory required and waste bandwidth.</li> <li> Global memory. The memory system is most efficient when you are able to do large transactions. 128 bit transactions tend to be faster than 64 bit, which tend to be faster than 32 bit. For this reason it's a good idea to pack (and align) your data so that you can move more than one into a thread with a single instruction.</li> </ol> <h3>Conclusion</h3> I don't know of any significant reasons not to use <code>char</code> if possible instead of <code>int</code> for arithmetic where everything lies in registers, though you will pay a conversion cost when reading/writing to memory. Storing an array as <code>char</code> instead of <code>int</code> should, if you're careful, give both a bandwidth and space saving.

Is there a penalty to using char variables in CUDA kernels?

Tags:

c++

performance

c

types

cuda

I seem to recall getting the hint that I should try to avoid using char's in CUDA kernels, because of the SMs liking of 32-bit integers. Is there some speed penalty for using them? For example, is it slower to do

int a[4];
int b = a[0] + a[1] + a[2] + a[3];
a[1] = a[3];
a2[0] = a[0]

than

char a[4];
char b = a[0] + a[1] + a[2] + a[3];
a[1] = a[3];
a2[0] = a[0]

in kernel code?

Notes:

I'm interested in the penalty/ies for doing arithmetic with char values, performing comparisons, and reading and writing them to memory.

273

asked Nov 18 '14 11:11

einpoklum

2 Answers

A quick note up front: In C/C++ the signedness of char is implementation defined. When using char to perform 8-bit integer arithmetic, it is therefore highly advisable to use signed char or unsigned char specifically as required by the computation.

A negative performance impact from using char types in CUDA is likely. I would not advise the use of char types unless memory size constraints (including shared memory size limitations) or the nature of the computation specifically require it.

CUDA is a C++ derived language that follows basic C++ language specifications. C++ (and C) specifies that in an expression data of a type narrower than int must be widened to int before entering the computation. Unless the integer instructions of the underlying hardware come with built-in conversion, this implies that additional conversion instructions are needed, which will increase dynamic instruction count and likely lower performance.

Note that compilers are allowed to deviate from the abstract C++ execution model under the "as-if" rule: As long as the resulting code behaves as if it follows the abstract model, i.e., its semantics are identical, it is allowed to eliminate these conversion operations. My recent experiments suggest that the CUDA 6.5 compiler is applying such optimizations aggressively and is therefore able to eliminate most conversion operations either outright or by merging them into other instructions.

However, this is not always possible. A simple contrived example is the following kernel, which contains an additional conversion instruction I2I.S32.S8 when instantiated with T = char versus T = int. I verified this by running cuobjdump --dump-sass on the executable to dump the machine code.

template <class T>
__global__ void kernel (T *out, const T *in)
{
    int tid = threadIdx.x;
    if (threadIdx.x < 128) {
        T foo = 5 * in[tid] + 7 * in[tid+1];
        out [tid] = foo * foo;
    }
}

Besides increased instruction count, negative performance impact from use of char types can also result due to lower memory bandwidth. The design of the GPU's memory subsystem is such that total achievable global memory bandwidth generally increases with the width of the accesses. One possible explanation for this is the finite depth of the internal queues that track memory accesses, but there may be other factors at work.

Where char types naturally occur due to the nature of a use case, such as image processing, one would want to look into the use of 32-bit compound types such as uchar4. The use of the wider type during load and store operations allows for improved memory bandwidth. CUDA has SIMD intrinsics for manipulating packed char data, and using those can beneficially reduce dynamic instruction count. Note that the SIMD intrinsics are fully backed by hardware only on Kepler GPUs, are fully emulated on Fermi CPUs, and are partially emulated on Maxwell GPUs. I have seen anecdotal evidence that even the emulated versions can still provide a performance benefit compared to handling each byte separately. I would suggest verifying that in the context of any particular use case.

There is the also a very brief reference to this issue in section 11.1.3 of the CUDA Best Practices Guide:

The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for...

functions operating on char or short whose operands generally need to be converted to an int.

...

160

answered Nov 16 '22 02:11

njuffa

Arithmetic

It's not possible to say in the generic sense whether it'll be faster/slower/unchanged, though usually I'd not expect much difference. You're correct in saying that arithmetic for chars will be in 32-bit, but whether this requires a type conversion will depend on the problem. In the question's example, I'd expect to see the compiler store a and b in 32-bit registers, and in my experiments around this problem (note, without a full reproducing case it's hard to guarantee this) I didn't see a difference in SASS. For the region of the code where everything is done in registers I wouldn't expect a performace hit.

There is a impact, however, as the char variables are moved two and from memory. As the char will have to be cast into a 32 bit register before use this will incur additional instructions. This may be a considerable impact, or may not be.

Now, there are also some edge cases which may make a difference. The compiler might be able to pack multiple chars into a register and extract them with arithmetic (register saving vs arithmetic cost). You may even be able to force this behaviour using unions. Whether the saving is worth the instructions will vary on a case-by-case basis. I can't think of any others which would incur significant casting overhead at the moment.

Memory

Rather obviously if you can store you variables in 1 byte instead of 4 you're going to get a 4x saving in memory and bandwidth required. There are things to consider though:

Shared memory. Current shared bank sizes are either 4 bytes or 8 bytes. Unless you're reading with transactions of at least 4/8 bytes per thread, you cannot achieve peak shared memory bandwidth. There's also bank conflicts to consider with smaller transactions. A 1 byte read with a stride of the bank size will avoid these bank conflicts, but increase your memory required and waste bandwidth.
Global memory. The memory system is most efficient when you are able to do large transactions. 128 bit transactions tend to be faster than 64 bit, which tend to be faster than 32 bit. For this reason it's a good idea to pack (and align) your data so that you can move more than one into a thread with a single instruction.

Conclusion

I don't know of any significant reasons not to use char if possible instead of int for arithmetic where everything lies in registers, though you will pay a conversion cost when reading/writing to memory. Storing an array as char instead of int should, if you're careful, give both a bandwidth and space saving.