Here's some Python code that implements a sliding-window computation on two 3D matrices, X and Y.
import numpy
def sliding_dot( X,Y ) :
assert X.ndim == Y.ndim == 3
iw,ih,id = X.shape
fw,fh,fd = Y.shape
assert id == fd
assert fw < iw and fh < ih
ow,oh = iw-fw+1,ih-fh+1
out = numpy.zeros( [ow,oh] )
for x in xrange(ow) :
for y in xrange(oh) :
window = X[x:x+fw,y:y+fh,:]
out[x,y] = numpy.dot( window.flatten(),Y.flatten() )
return out
#################
A_dims = (640,480,32)
B_dims = (6,6,32)
A = numpy.random.rand(*A_dims)
B = numpy.random.rand(*B_dims)
sliding_dot(A,B)
In general, Y is always much smaller than X along the first and second dimensions, but they are equal in the third dimension.
Note that we could replace numpy.dot() with any function of Y and the window. This is a little bit different than convolution in that Y only slides along the first and second dimensions of X. I'm looking for an effective strategy for implementing this kind of sliding window computation, efficiently, using CUDA. Anybody want to offer me some direction? Cheers!
Update : You can watch me work through the optimization process with help from other users in my answer, below.
Trying to design a "generalised" implementation which could accommodate just about any operation you might want is going to be an enormous trade off in an architecture like CUDA. For your concrete dot product example, which is a typical reduction operation, this is a pretty useful implementation:
__constant__ int ldaX[3];
__constant__ int ldaY[3];
__constant__ int dimX[3];
__constant__ int dimY[3];
template<typename real,int blocksize>
__global__ void sliding_k(const real *X, const real *Y, real *out)
{
__shared__ volatile real buffer[blocksize];
int tid = threadIdx.x;
int gid = blockIdx.x * gridDim.y + blockIdx.y;
real value = (real)0;
int xpos = (blockIdx.y * ldaX[2]) + (blockIdx.x * ldaX[1]);
int ypos = 0;
for(int i=0; i<dimY[0]; i++) {
for(int jk=tid; jk<ldaY[1]; jk+=blocksize) {
value += X[xpos+jk] * Y[ypos+jk];
}
xpos += ldaX[1];
ypos += ldaY[1];
}
buffer[tid] = value;
__syncthreads();
# pragma unroll
for(int i=(tid+32); ((tid<32)&&(i<blocksize)); i+=32)
buffer[tid] += buffer[i];
if (tid < 16) buffer[tid] += buffer[tid + 16];
if (tid < 8) buffer[tid] += buffer[tid + 8];
if (tid < 4) buffer[tid] += buffer[tid + 4];
if (tid < 2) buffer[tid] += buffer[tid + 2];
if (tid == 0) out[gid] = buffer[0] + buffer[1];
}
You could substitute any kind of reduction operator you like for the floating point multiply add/summation operation which a dot product uses and the code should work OK. Each window calculation is performed by a single block. There is enough parallel work to justify at this window size a block per window. This allows coalesced global memory access, and on Fermi cards, a good amount of L1 cache hits.
Here I have only build one assumption into the code, that being that the third dimension of the source array and the window array are equal. This allows the inner two loops to be "fused" into a single operation because the common memory layout they share. Running a test harness in Python using an improved version of your reference code, with the host code written in PyCUDA, I get this:
In [15]: %timeit -n3 -r3 out2=sliding_cuda(A,B)
3 loops, best of 3: 49.8 ms per loop
In [16]: %timeit -n3 -r3 out=sliding_dot(A,B)
3 loops, best of 3: 2.18 s per loop
In [17]: (numpy.abs(out2-out)/numpy.abs(out)).max()
Out[17]: 4.2921323635558404e-15
when run on a 3GHz Phenom II with a GTX470 using 64 thread blocks on a 635x475 2D grid -- ie. about 50 times speed up including module loading, setup and memory transfers using pageable host memory allocations. The kernel itself is about 100 times faster than the Python without including memory transfers and setup overhead. Note that this is a double precision version - Python uses double precision floating point arithmetic by default.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With