In my kernel it is necessary to make a large number of random accesses to a small lookup table (only 8 32-bit integers). Each kernel has a unique lookup table. Below is a simplified version of the kernel to illustrate how the lookup table is used.
__kernel void some_kernel(
__global uint* global_table,
__global uint* X,
__global uint* Y) {
size_t gsi = get_global_size(0);
size_t gid = get_global_id(0);
__private uint LUT[8]; // 8 words of of global_table is copied to LUT
// Y is assigned a value from the lookup table based on the current value of X
for (size_t i = 0; i < n; i++) {
Y[i*gsi+gid] = LUT[X[i*gsi+gid]];
}
}
Because of the small size, I am getting the best performance by keeping the table in the __private memory space. However, because of the random nature in which the lookup table is accessed, there is still a large performance hit. With the lookup table code removed (replaced with a simple arithmetic operation, for example), although the kernel would provide the wrong answer, the performance improves by a factor of over 3.
Is there a better way? Have I overlooked some OpenCL feature that provides efficient random access for very small chunks of memory? Could there be an efficient solution using vector types?
[edit] Note, that the maximum value of X is 7, but the maximum value of Y is as large as 2^32-1. In other words, all the bits of the lookup table are being used, so it cannot be packed into a smaller representation.
The fastest solution I can think of is to not use arrays in the first place: use individual variables instead and use some sort of access function to access them as if they were an array. IIRC (at least for the AMD compiler but I'm pretty sure this is true for NVidia as well): generally, arrays are always stored in memory, while scalars may be stored in registers. (But my mind is a little fuzzy on the matter — I might be wrong!)
Even if you need a giant switch statement:
uint4 arr0123, arr4567;
uint getLUT(int x) {
switch (x) {
case 0: return arr0123.r0;
case 1: return arr0123.r1;
case 2: return arr0123.r2;
case 3: return arr0123.r3;
case 4: return arr4567.r0;
case 5: return arr4567.r1;
case 6: return arr4567.r2;
case 7: default: return arr4567.r3;
}
}
... you might still come out ahead in performance compared to a __private array, since, assuming the arr variables all fit in registers is purely ALU-bound. (Assuming you have enough spare registers for the arr variables, of course.)
Note, some OpenCL targets don't even have private memory, and anything you declare there just goes to __global. Using register storage is an even bigger win there.
Of course, this LUT approach is likely to be slower to initialize, since you will need at least two separate memory reads to copy the LUT data from global memory.
As rtollert stated it is up to the implementation to decide if LUT[] is placed in registers or into global memory. Normally arrays in a kernel are a no-no but since it is small it's hard to say where it will be placed. Assuming that LUT[] is placed into registers I would say the reason it's taking a long time compared to a simple arithmetic operation isn't because it's accessed randomly but because each work item makes an additional 8(Edit: apparently a lot more) global reads of X to calculate the LUT index. Depending on what's omitted could you do something like Y[i*gsi+gid] = global_table[someIndex + X[i*gsi+gid]]];?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With