I'm running a mandelbrot generator (2D image from static params) on OpenCL. The program is straightforward:
__kernel
void mandelbrot(__global uchar * output,
const float xstep,
const float xoffset,
const float ystep,
const float yoffset,
const int maxiter)
{
int gid_y = get_global_id(1);
int gid_x = get_global_id(0);
//calculate x and y on the fly for every pixel.
//This is just as fast as reading precalculated rulers from global memory.
float x = gid_x * xstep + xoffset;
float y = gid_y * ystep + yoffset;
float real = 0;
float imag = 0;
int out = 0;
for(int curiter = 0; curiter < maxiter; curiter++) {
float nreal = real*real - imag*imag + x;
imag = 2* real*imag + y;
real = nreal;
if (real*real + imag*imag > 4.0f) {
out = curiter;
break;
}
}
//normalize output
out *= 256.0 / (float)maxiter;
output[gid_y * get_global_size(0) + gid_x] = out;
}
[EDIT] [posted full kernel, and swapped rows and columns as suggested. This way I gained 18% performance on AMD, but 0% on NVidia. The original code was
output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;
[/EDIT]
I'm running it on my Nvidia Quadro 1000M, which has 2 compute units and 96 CUDA cores (48 cores per compute unit).
I'm playing around by changing the local group size when enqueuing the kernel. These are the performance results I get with different sizes when generating a 400Mpixel image. All numbers are from the OpenCL profiler and exclude the final memory copy back to the OS. The image is 40992x10272 - both height and width are divisible by 48.
rows x columns
8x8: 397 MPixel/s
8x12: 505 MPixel/s
8x16: 523 MPixel/s
8x24: 521 MPixel/s
8x32: 520 MPixel/s
8x48: 520 MPixel/s
1x48: 321 MPixel/s
2x32: 424 MPixel/s
2x48: 523 MPixel/s
4x24: 519 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
4x48: 525 MPixel/s
12x8: 490 MPixel/s
12x12:464 MPixel/s
12x24:505 MPixel/s
12x32:508 MPixel/s
12x48:433 MPixel/s
16x8: 499 MPixel/s
16x12:499 MPixel/s
16x16:472 MPixel/s
16x24:450 MPixel/s
16x32:440 MPixel/s
16x48:418 MPixel/s
Some of these numbers leave me baffled. While it is clear why I get best results with 48 columns (thanks to how SIMD operations work), I don't understand:
Thanks in advance
I answered to a similar question here that you might find interesting before reading the following.
Actually it already degrades when you use 12 rows. Memory access works by transaction. A transaction will fetch a certain number of bytes in one shot. Now if several workitems try to access several contiguous elements in an array it means that one transaction might be enough to serve them all.
Because you access the memory in this way:
output[get_global_id(0) * get_global_size(1) + get_global_id(1)] = out;
it means that the bigger the local size is in the dimension 0, the bigger the number of transaction will be since you have to access non contiguous elements (separated by get_global_size(1) elements). And global memory access is expensive.
So in the case of the 12/16 rows, you have at least 12/16 transactions needed. This lead to your second question:
Based on what I've just said before, it seems that the performance should be great, since the number of transactions would be minimal.
But here comes the problem of idling threads. The information you got regarding the 48 cores per SM is wrong as already pointed out by others. Threads are executed in group (called warp for NVIDIA) of 32 on NVIDIA hardware. Note these groups are called wavefront and can be up to 64 threads for AMD. Since you have in this case a workgroup composed of 48 threads (1 by 48), it means that 64 threads are scheduled. It is always a number of threads multiple of 32 that is scheduled because you can't execute a fraction of a warp.
Therefore in this case you have a fourth of the threads that do nothing. And actually when you compare with the result you obtained for 2x32 (still 64 threads - 2 warps, but fully utilized) 321 MPixel/s is pretty much 3/4 of 424 MPixel/s.
It is worth noting also this result: 2x48: 523 MPixel/s. In this case your workgroup size is 96 a multiple of 32. So no idling threads.
Well, the answer comes from the two previous ones: You use multiple of 32, and you keep the number of threads in the dimension 0 relatively small. But let's have a closer look to your results:
2x32: 424 MPixel/s
3x32: 525 MPixel/s
4x32: 525 MPixel/s
8x32: 520 MPixel/s
16x32: 440 MPixel/s
The decrease of performance for the two last lines is easily explained with what was said. However, the increase of performance between the first and the second line is not.
The increase of performance comes form somewhere else in this case. The fact that in the second case enough warps run on the same SM to hide the access memory latency. You see the REFERRED_WORK_GROUP_SIZE_MULTIPLE value says only that you should try to use a MULTIPLE of this value for best performance. Several warps can be scheduled on the same SM at the same time.
So, how does it work? Let's take the 3x32 case. You have a workgroup composed of 3 warps. Since they belong to the same workgroup they are scheduled on the same SM as required by the OpenCL standard (if it wasn't the case, sync between threads within a workgroup wouldn't be possible).
The first warp starts to run until it gets stall because a memory access is needed. Meanwhile the warp 1 waits for the memory transactions to complete, the warp 2 can start to run. Since there is a lot of registers on the SM, the SM can easily and quickly switch context to run others warps. All the variables of the warp 1 stay on the registers allocated to the warp 1. Then the warp 2 hits the line where a memory access is required and gets stall. At that moment, the next ready to run warp can start running. It could be the warp 3 but also the warp 1 if its memory access is completed. In your case it seems that it is the warp 3 that runs, since you have a difference between 2x32 and 3x32. In the first case there are not enough warps scheduled to hide the memory accesses though in the second case there are.
As a mater of fact, this influence as well the bad performance for the 1x48 size from the question 2.
Already answered.
It's like for any other languages. When you know how it works under the hood, it helps you to produce good first code. But you'll still have to benchmark it, and go through a process of trial and errors to tweak it. Keeping in mind what I've just written is only a small part of the things that matter for performance. Querying some info from OpenCL combined with a good understanding of CPU/GPU will obviously help... but that's it.
Because a lot of parameters influencing performance are antagonists, what you'll gain inone side, will be lost in the other.
Therefore keep benchmarking ;).
It all depends on the code you are not showing. And that is the key.
If your code was very simple ie: out = 8;
then your supposition will probably be correct.
However, as you said, the value REFERRED_WORK_GROUP_SIZE_MULTIPLE returns 32. This means, that 32 is the maximum concurrent threads the compute unit can launch in parallel without affecting the performance. For example, there is no sense in launching more than 32. If with 32 you already deplete the local memory storage and you need to recur to global memory (which is dammly slow).
If you try to go over the recomended limit, you obtain exactly that -> A performance decrease. It is not that 32 is better, is the oposite. 48 is bad.
I recomend to you:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With