Implementing Neural Network using CUDA

Question

I am trying to create a Neural Network using CUDA:

My kernel looks like :

__global__ void feedForward(float *input, float *output, float **weight) {

//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;

//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
        * input[weightIndex];
}

While copying the output back to host, I'm getting an error

Error unspecified launch failure at line xx

At line xx :

CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));

Am I doing something wrong here?

Is it because of how I'm using both the block index as well as thread index to reference the weight matrix. Or does the problem lie elsewhere ?

I'm allcoating the weight matrix as follows:

cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);

My kernel call is:

feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);

After that i call: cudaThreadSynchronize();

I am new to programming with CUDA. Any help would be appreciated.

Thanks

Oleg Titov · Accepted Answer

There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.

int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];

We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:

__global__ void feedForward(float *input, float *output, float **weight) {

  int weightIndex = threadIdx.x;
  int neuronIndex = blockIdx.x;
  __shared__ float out_reduce[NO_OF_WEIGHTS];

  out_reduce[weightIndex] = 
     (weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ? 
       weight[neuronIndex][weightIndex] * input[weightIndex]
       : 0.0;
  __syncthreads();

  for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
  {
    if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
    __syncthreads();
  }

  if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex]; 
}

It turned out that I had to rewrite half of you small kernel to help with reduction code...

Implementing Neural Network using CUDA

Tags:

artificial-intelligence

neural-network

cuda

gpgpu

Shayan RC

1 Answers

Oleg Titov

Recent Activity

Donate For Us

Implementing Neural Network using CUDA

Tags:

artificial-intelligence

neural-network

cuda

gpgpu

Shayan RC

1 Answers

Oleg Titov

Related questions

Recent Activity

Donate For Us