Simple adding of two int's in Cuda, result always the same

Question

I'm starting my journary to learn Cuda. I am playing with some hello world type cuda code but its not working, and I'm not sure why.

The code is very simple, take two ints and add them on the GPU and return the result, but no matter what I change the numbers to I get the same result(If math worked that way I would have done alot better in the subject than I actually did).

Here's the sample code:

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    *c = a + b;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaMalloc( (void**)&dev_c, sizeof(int) );
    add<<<1,1>>>( 1, 4, dev_c );

    cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );

    printf( "1 + 4 = %d
", c );
    cudaFree( dev_c );

}

The output seems a bit off: 1 + 4 = -1065287167

I'm working on setting up my environment and just wanted to know if there was a problem with the code otherwise its probably my environment.

Update: I tried to add some code to show the error but I don't get an output but the number changes(is it outputing error codes instead of answers? Even if I don't do any work in the kernal other than assign a variable I still get simlair results).

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    //*c = a + b;
    *c = 5;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
    if(err != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }
    add<<<1,1>>>( 1, 4, dev_c );

    cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
    if(err2 != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }


    printf( "1 + 4 = %d
", c );
    cudaFree( dev_c );

}

Code appears to be fine, maybe its related to my setup. Its been a nightmare to get Cuda installed on OSX lion but I thought it worked as the examples in the SDK seemed to be fine. The steps I took so far are go to the Nvida website and download the latest mac releases for the driver, toolkit and SDK. I then added export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH and 'PATH=/usr/local/cuda/bin:$PATH` I did a deviceQuery and it passed with the following info about my system:

[deviceQuery] starting...

/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          4.2 / 4.2
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors x (  8) CUDA Cores/MP:    48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED

UPDATE: what's really weird is even if I remove all the work in the kernel I stil get a result for c? I have reinstalled cuda and used make on the examples and all of them pass.

talonmies · Accepted Answer

Basically there are two problems here:

You are not compiling the kernel for the correct architecture (gleaned from comments)
Your code contains imperfect error checking which is missing the point when the runtime error is occurring, leading to mysterious and unexplained symptoms.

In the runtime API, most context related actions are performed "lazily". When you launch a kernel for the first time, the runtime API will invoke code to intelligently find a suitable CUBIN image from inside the fat binary image emitted by the toolchain for the target hardware and load it into the context. This can also include JIT recompilation of PTX for a backwards compatible architecture, but not the other way around. So if you had a kernel compiled for a compute capability 1.2 device and you run it on a compute capability 2.0 device, the driver can JIT compile the PTX 1.x code it contains for the newer architecture. But the reverse doesn't work. So in your example, the runtime API will generate an error because it cannot find a usable binary image in the CUDA fatbinary image embedded in the executable. The error message is pretty cryptic, but you will get an error (see this question for a bit more information).

If your code contained error checking like this:

cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
    printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}

cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

the extra error checking after the kernel launch should catch the runtime API error generated by the kernel load/launch failure.

Simple adding of two int's in Cuda, result always the same

Tags:

cuda

Lostsoul

1 Answers

talonmies

Recent Activity

Donate For Us

Simple adding of two int's in Cuda, result always the same

Tags:

cuda

Lostsoul

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us