C++ and CUDA: why does the code return different results each time?

Question

Update: I found the bug. Since the code I posted before is very complicated, I simplify them and only keep the part when the problem is.

if (number >= dim * num_points)
    return;

But actually, I only have num_points, I want to use num_points thread, so the correct way should be

if (number >= num_points)
    return;

Thank you all for the help.

I'm rewriting some C++ code from CPU to GPU. And the code is pasted below. Sorry it's long, since I think the problems are easier to be detected in this way.

In the code, for every thread I need some matrix format intermediate results, so I allocate device memory for these intermediate results, such as d_dir2, d_R, d_Stick, d_PStick. The results turned out to be not what I expected, so to debug, I tried to output some intermediate results R in this way:

if (k == 0)
 {
 results[tmp_int1 + i * dim + j] = R[tmp_int1 + i * dim + j];
 }

and later in C++, I print results. However, I found that results give different values each time. Sometimes it gives the correct answer R, sometimes, the value of PStick, sometimes a combination of R and PStick, and sometimes a combination of R and 0 (results are initialized to 0 at the beginning).

I'm very confused what caused the problem. Any idea? Thank you very much :)

__global__ void stickvote(const int dim, const int num_points, const int gridx, float Sigma, float* input, float* dir2, float* R, float* Stick, float* PStick, float* results) {
  float threshold = 4 * Sigma;
  float c = (- 16 * log(0.1f) * (sqrt(Sigma) - 1)) / 3.1415926f / 3.1415926f;

  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  int number = row * BLOCK_SIZE * gridx + col;

  if (number >= dim * num_points)  //// The bug is here!
    return;
}


extern "C" void KernelStickVote(int dim, int num_points, float Sigma, float* input, float* results) {
  const int totalpoints = num_points;
  const int totalpoints_input = (dim + 1)* (dim + 1) * num_points;
  const int totalpoints_output = dim * dim * num_points;
  size_t size_input = totalpoints_input * sizeof(float);
  size_t size_output = totalpoints_output * sizeof(float);

  float* d_input;
  cutilSafeCall(cudaMalloc((void**)&d_input, size_input));

  float* d_result;
  cutilSafeCall(cudaMalloc((void**)&d_result, size_output));

  // used to save dir, and calculate dir * dir'
  float* d_dir2;
  cutilSafeCall(cudaMalloc((void**)&d_dir2, dim * num_points * sizeof(float)));

  // used to save R: dim * dim * N
  float* d_R;
  cutilSafeCall(cudaMalloc((void**)&d_R, size_output));

  // used to save Stick: dim * dim * N
  float* d_Stick;
  cutilSafeCall(cudaMalloc((void**)&d_Stick, size_output));

  // used to save Stick: dim * dim * N
  float* d_PStick;
  cutilSafeCall(cudaMalloc((void**)&d_PStick, size_output));

  // Copy input data from host to device
  cudaMemcpy(d_input, input, size_input, cudaMemcpyHostToDevice);

  int totalblock = (totalpoints % BLOCKPOINTS==0 ? totalpoints/BLOCKPOINTS : (int(totalpoints/BLOCKPOINTS) + 1));
  int gridx = (65535 < totalblock ? 65535 : totalblock);
  int gridy = (totalblock % gridx == 0 ? totalblock/gridx : (int(totalblock/gridx)+1) );
  dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
  dim3 dimGrid(gridx, gridy);

  stickvote<<<dimGrid, dimBlock>>>(dim, num_points, gridx, Sigma, d_input, d_dir2, d_R, d_Stick, d_PStick, d_result);
  cudaMemcpy(results, d_result, size_output, cudaMemcpyDeviceToHost);

  cudaFree(d_input);
  cudaFree(d_result);
  cudaFree(d_dir2);
  cudaFree(d_R);
  cudaFree(d_Stick);
  cudaFree(d_PStick);
}

talonmies · Accepted Answer

The original poster of the question performed some further code simplification and debugging his/herself and discover that the guard statement in the kernel:

if (number >= dim * num_points)
    return;

was, in fact, incorrect and should have been

if (number >= num_points)
    return;

This was the source of the error.

This answer has been added as a community wiki answer with the intention of removing this question from the unanswered queue.

C++ and CUDA: why does the code return different results each time?

Tags:

c++

cuda

user1834981

1 Answers

talonmies

Recent Activity

Donate For Us

C++ and CUDA: why does the code return different results each time?

Tags:

c++

cuda

user1834981

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us