Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this parallelism be implemented in OpenCL

Tags:

opencl

barrier

This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.

I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.

1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].

2) Outputs: Two 1D arrays of length (n): C, D.

C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])

So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:

__kernel void test (__global float* A, __global float* B, __global float* C,
                    __global float* D, int n, float C0, float D0) {
    int i, j=get_global_id(0);

    if (j==0) {
       C[0] = C0;
       for (i=1;i<=n-1;i++) {
          C[i] = function1(C[i-1]);
          [WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
       }
       return;
    }
    else {
       D[0] = D0;
       for (i=1;i<=n-1;i++) {
          D[i] = function2(C[i-1],D[i-1]);
          [WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
       }
       return;
    }
}

Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?

like image 280
Shoomla Avatar asked Feb 05 '16 08:02

Shoomla


1 Answers

This can be implemented in opencl, but like the other answer says, you're going to be limited to 2 threads at best.

My version of your function should be called with a single work group having two work items.

__kernel void test (__global float* A, __global float* B, __global float* C, __global float* D, int n, float C0, float D0)
{
    int i;
    int gid = get_global_id(0);

    local float prevC;
    local float prevD;

    if (gid == 0) {
        C[0] = prevC = C0;
        D[0] = prevD = D0;
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    for (i=1;i<=n-1;i++) {
        if(gid == 0){
            C[i] = function1(prevC);
        }else if (gid == 1){
            D[i] = function2(prevC, prevD);
        }

        barrier(CLK_LOCAL_MEM_FENCE);
        prevC = C[i];
        prevD = D[i];
    }
}

This should run on any opencl hardware. If you don't care about saving all of the C and D values, you can simply return prevC and prevD in two floats rather than the entire list. This would also make it much faster due to sticking to a lower cache level (ie local memory) for all reading and writing of the intermediate values. The local memory boost should also apply to all opencl hardware.

So is there a point to running this on a GPU? Not for the parallelism. You are stuck with 2 threads. But if you don't need all values of C and D returned, you would probably see a significant speed up because of the much faster memory of GPUs.

All of this assumes that function1 and function2 aren't overly complex. If they are, just stick to CPUs -- and probably another multiprocessing technique such as OpenMP.

like image 157
mfa Avatar answered Sep 24 '22 02:09

mfa