I have installed the Theano library for increasing the speed of a computation, so that I can use the power of a GPU.
However, inside the inner loop of the computation a new index is calculated, based on the loop index and corresponding values of a couple of arrays.
That calculated index is then used to access an element of another array, which, in turn, is used for another calculation.
Is this too complicated to expect any significant speedups from Theano?
So let me rephrase my question, the other way round. Here is an example of GPU code snippet. Some initialisations are left out for reasons of brevity. Can I translate this to Python/Theano without increasing computation times considerably?
__global__ void SomeKernel(const cuComplex* __restrict__ data,
float* __restrict__ voxels)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int idy = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int pos = (idy * NX + idx);
unsigned int ind1 = pos * 3;
float x = voxels[ind1];
float y = voxels[ind1 + 1];
float z = voxels[ind1 + 2];
int m;
for (m = 0; m < M; ++m)
{
unsigned int ind2 = 3 * m;
float diff_x = x - some_pos[ind2];
float diff_y = y - some_pos[ind2 + 1];
float diff_z = z - some_pos[ind2 + 2];
float distance = sqrtf(diff_x * diff_x
+ diff_y * diff_y
+ diff_z * diff_z);
unsigned int dist = rintf(distance/some_factor);
ind3 = m * another_factor + dist;
cuComplex some_element = data[ind3];
Main calculation starts, involving some_element.
No, I see nothing which cannot be done using Tensors instead of a for-loop. This should mean that you might see an increase in speed, but this will really depend on the application. You have an overhead of python+theano as well, especially coming from c-like code.
So, instead of
for (m = 0; m < M; ++m)
{
unsigned int ind2 = 3 * m;
float diff_x = x - some_pos[ind2];
float diff_y = y - some_pos[ind2 + 1];
float diff_z = z - some_pos[ind2 + 2];
float distance = sqrtf(diff_x * diff_x
+ diff_y * diff_y
+ diff_z * diff_z);
unsigned int dist = rintf(distance/some_factor);
ind3 = m * another_factor + dist;
cuComplex some_element = data[ind3];
}
You could do something like (of the top of my head)
diff_xyz = T.Tensor([x,y,z]).dimshuffle('x',0) - some_pos.reshape(-1,3)
distance = T.norm(diff_xyz)
dist = T.round(distance/some_factor)
data = data.reshape(another_factor,-1)
some_elements = data[:,dist]
See? No more loops, therefore a GPU can parallellize this.
However, inside the inner loop of the computation a new index is calculated, based on the loop index and corresponding values of a couple of arrays. (...) Is this too complicated to expect any significant speedups from Theano?
In general: this can be optimized, as long as the loop index has a linear relation with the index needed, by using tensors instead of loops. It however needs a bit of creativity and massaging to get right.
Non-linear relations are also possible using Tensor.take(), but I don't dare to vouch for its speed on GPU. My gut-feeling always told me to stay away from it, as it is probably too flexible to optimize nicely. However, it is possible to use when there are no alternatives.
GPUs aren't great at random access memory when working with their global memory. I've not used Theano before but if your arrays all fit in local memory - this would be fast as random accesses aren't a problem there. If it is global memory though it is hard to anticipate what performance would be but it would be a far cry from it's full power. On another note, is something about this computation even parallelizable? GPUs only really do well when there's alot of these things going on concurrently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With