Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find minimum of large array using Opencl

I am working on the implementation of a hierarchical clustering algorithm in opencl. For each step, I have find the minimum value in a very large array (approx. 10^8 entries) so that I know which elements have to be combined into a new cluster. The identification of the minimum value must be done 9999 times. With my current kernels, it takes about 200 seconds to find the minimum value (accumulated over all iterations). How I approached the problem is by dividing the array into 2560 equally sized fragments (there are 2560 stream processors on my Radeon 7970) and to find the minimum of each fragment individually. The I run a second kernel that combines these minima into a global minimum.

It there any more efficient way to approach this problem? The initial idea was to speed up HCA by using OpenCL but with the amount of time that the identification of the minimum takes it is much longer than the matlab HCA on the CPU. What am I doing wrong?

__kernel void findMinValue(__global float * myArray, __global double * mins, __global int * elementsToWorkOn, __global int * arraysize){
int gid = get_global_id(0);
int minloc = 0;
float mymin = INFINITY;
int eltoWorkOn = *elementsToWorkOn;
int offset = gid*eltoWorkOn;
int target = offset + eltoWorkOn;

if (offset<*arraysize){
    //make sure the array size is not exceeded
    if (target > *arraysize){
        target = *arraysize;
    }

    //find minimum for the kernel
    for (int i = offset; i < target; i++){
        if (*(myArray + i) < mymin){
            mymin = *(myArray + i);
            minloc = i;
        }
    }
}
*(mins + gid * 2) = minloc;
*(mins + gid * 2 + 1) = mymin;
}


__kernel void getGlobalMin(__global double * mins, __global double * gmin, __global int * pixelsInImage){
    int nWorkitems = 2560;
    float globalMin = INFINITY;
    double globalMinLoc;
    float tempMin;
    for (int i = 0; i < nWorkitems; i++){
        tempMin = *(mins + 2 * i + 1);
        if (tempMin < globalMin){
            globalMin = tempMin;
            globalMinLoc = *(mins + 2 * i);
        }
    }
    *(gmin + 0) = globalMinLoc;
    *(gmin + 1) = globalMin;
}

UPDATE

I redesigned the findMinValue Kernel based on your suggestions. The memory access is now coalescent and I divided the work into work groups, so that I can reduce the amount of global memory accesses. Before, every kernel wrote its minimum value to the global mins buffer. Now only one kernel per worg group writes one value (i.e. the group minimum). Furthermore, I increased the global work size in order to hide memory latency.

These changes allowed to reduce the time required for identifying the minima from >200s to only 59s! Thank you very much for your help!

Is there anything else I could have missed while optimizing the kernel? Do you have any further suggestions? I could not figure out how to use setArg(). Do I have to pass a pointer to the int value to it (like this: err = clSetKernelArg(kernel[2], 3, sizeof(int), &variable);). How woudl the kernel declaration look in this case?

Here is my new Kernel:

__kernel void findMinValue(__global float * myArray, __global double * mins, __global int * arraysize,__global int * elToWorkOn,__global int * dummy){
int gid = get_global_id(0);
int lid = get_local_id(0);
int groupID = get_group_id(0);
int lsize = get_local_size(0);
int gsize = get_global_id(0);
int minloc = 0;
int arrSize = *arraysize;
int elPerGroup = *elToWorkOn;
float mymin = INFINITY;


__local float lmins[128];
//initialize local memory
*(lmins + lid) = INFINITY;
__local int lminlocs[128];

//this private value will reduce global memory access in the for loop (temp = *(myArray + i);)
float temp;

//ofset and target of the for loop
int offset = elPerGroup*groupID + lid;
int target = elPerGroup*(groupID + 1);

//prevent that target<arrsize (may happen due to rounding errors or arrSize not a multiple of elPerGroup
target = min(arrSize, target);

//find minimum for the kernel
//offset is different for each lid, leading to sequential memory access
if (offset < arrSize){
    for (int i = offset; i < target; i += lsize){
        temp = *(myArray + i);
        if (temp < mymin){
            mymin = temp;
            minloc = i;
        }
    }

    //store kernel minimum in local memory
    *(lminlocs + lid) = minloc;
    *(lmins + lid) = mymin;

    //find work group minimum (reduce global memory accesses)
    lsize = lsize >> 1;
    while (lsize > 0){
        if (lid < lsize){
            if (*(lmins + lid)> *(lmins + lid + lsize)){
                *(lmins + lid) = *(lmins + lid + lsize);
                *(lminlocs + lid) = *(lminlocs + lid + lsize);
            }
        }
        lsize = lsize >> 1;
    }
}
//write group minimum to global buffer
if (lid == 0){
    *(mins + groupID * 2 + 0) = *(lminlocs + 0);
    *(mins + groupID * 2 + 1) = *(lmins + 0);
}
}
like image 519
mTORjaeger Avatar asked Nov 01 '22 21:11

mTORjaeger


1 Answers

If each work item iterates through a global array there is ZERO coalescing of reads. If you change it so each work items strides by the warp or wavefront size then you'd get a huge speed gain.

like image 78
Dithermaster Avatar answered Nov 05 '22 16:11

Dithermaster