Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this opencl code non-deterministic?

The following python code uses PyOpenCL to fill the array a_plus_b with the sum of the elements in array b (this isn't my actual objective, but it's the simplest code I can find that still shows the problem).

import pyopencl as cl
import numpy as np
import numpy.linalg as la

height = 50
width = 32

b = np.arange(width,dtype=np.int32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, height*4)

prg = cl.Program(ctx, """
    __kernel void sum(__global const int *b, __global int *c)
    {
      int x = get_global_id(1);
      int y;
      c[x] = 0;
      for(y=0;y<get_global_size(0);y++) {
          c[x] += b[y];
      }
    }
    """).build()

prg.sum(queue, (width,height), None, b_buf, dest_buf)

a_plus_b = np.empty(height,dtype=np.int32)
cl.enqueue_copy(queue, a_plus_b, dest_buf)

print(np.sum(b))
print(a_plus_b)
print(np.sum(a_plus_b-np.sum(b)))

Gives the output:

496
[496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496
 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496 496
 496 496 496 496 496 496 496 496 496 496 496 496 496 496]
0

However, if I change width from 32 to 33, the array is no longer the same element over and over again.

528
[555 557 555 559 560 528 560 560 528 528 528 528 528 528 528 528 528 528
 528 528 528 531 540 569 581 528 705 591 560 560 545 560 560 528 560 528
 528 528 528 528 528 528 528 528 528 528 528 532 533 535]
752

In fact, each time the code is run, it produces a different result.

528
[560 560 559 560 560 560 560 528 528 528 528 528 528 528 528 528 528 528
 528 528 528 560 528 514 565 553 621 650 560 560 560 560 560 528 528 528
 528 528 528 528 528 528 528 528 549 528 528 544 528 537]
724

What causes the difference? What aren't

like image 475
rprospero Avatar asked Nov 04 '22 00:11

rprospero


1 Answers

You are running WIDTH x HEIGHT work-items. For each value of X in your kernel, there will be WIDTH work-items doing exactly the same thing in parallel: setting C[X] to 0, and then updating it in the Y loop. All these WIDTH work-items will read C[X] and then update it more or less at the same time. This "more or less" is the cause of the variations you observe.

Your algorithm is 1D, and you need to run only HEIGHT work-items, and pass WIDTH as a kernel argument. Replace C[X] with a register "SUM", and do a single C[X]=SUM at the end.

like image 161
Eric Bainville Avatar answered Nov 15 '22 06:11

Eric Bainville