I have some arrays (actually thrust::device_vector) -
MyClass1 a[N];
int b[N];
that I have to use to fill
MyClass2 c[M];
a[0] generates the first b[0] elements in c[]. a[1] generates the next b[1] elements in c[], and so on. To be clear, M = b[0] + b[1] + ... + b[N-1]. Any b[i] can be zero. I need to create c[] and then process all its M elements in parallel using M threads.
Is there a way of writing a kernel that uses M threads, where each thread knows (or finds out in O(1) time) which corresponding MyClass1 in a[] it is associated to?
For example, if I have N=4 and
int b[4] = {2,0,3,5};
Then M=2+0+3+5= 10, so c[] is of size 10 and I need 10 threads (tid=0...9) to process c[] in parallel, where tid=i creates c[i]. Here,
The thread tid=3 knows it's supposed to create c[3], but how would it know it's supposed to use a[2] and generate its second element to do so?.
I would suggest summing up b[] to B[] which has elements
B[i] = sum(b[0] ... b[i])
Therefore M = B[N-1], you now launch M lanes where for each you use std::lower_bound to find and index i of the element in a[] which you want to use.
(I would probably do the first step on the cpu, you are probably doing something similar already to determine M.
pidx - B[i] will tell you the "local" index of the current element you are constructing)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With