OpenCL image histogram

Question

I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:

const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
                           CLK_ADDRESS_CLAMP|
                           CLK_FILTER_NEAREST;


__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
                               __global int* gOutput, __global int* bOutput)
{

    int2 coords = {get_global_id(0), get_global_id(1)};

    float4 sample = read_imagef(input, mSampler, coords);

    uchar rbin = floor(sample.x * 255.0f);
    uchar gbin = floor(sample.y * 255.0f);
    uchar bbin = floor(sample.z * 255.0f);

    rOutput[rbin]++;
    gOutput[gbin]++;
    bOutput[bbin]++;


}

When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).

The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?

Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...

Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).

Thanks!

Rick-Rainer Ludwig · Accepted Answer

As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:

Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.

If you work in two dimensions you can do it on parts of the picture.

[EDIT:]

I think, I have a better answer: ;-)

Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram

They have an interesting implementation there...

Bahbar · Answer

Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.

However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).

the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.

In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.

OpenCL image histogram

Tags:

gpgpu

gpu

opencl

wallacer

2 Answers

Rick-Rainer Ludwig

Bahbar

Recent Activity

Donate For Us

OpenCL image histogram

Tags:

gpgpu

gpu

opencl

wallacer

2 Answers

Rick-Rainer Ludwig

Bahbar

Related questions

Recent Activity

Donate For Us