CUDA kernel doesn't launch

Tags:

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this: nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):

 __global__ void myKernel() 
{ 

    cuPrintf("Hello, world from the device!\n"); 


} 
int main() 
{ 
    cudaPrintfInit(); 
    myKernel<<<1,10>>>(); 
    cudaPrintfDisplay(stdout, true);    
    cudaPrintfEnd(); 
}

The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?

611

asked Aug 28 '12 17:08

Tarek

2 Answers

The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. It is flushed only when one of these actions is performed:

Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well),

Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(),

Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),

Module loading/unloading via cuModuleLoad() or cuModuleUnload(),

Context destruction via cudaDeviceReset() or cuCtxDestroy().

For this reason, this program prints nothing:

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
}

But this program prints "Hello, world from the device!\n" ten times.

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
  cudaDeviceSynchronize();
}

answered Sep 21 '22 00:09

harrism

Are you sure that your CUDA device supports the SM_20 architecture?

Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.

Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.

answered Sep 21 '22 00:09

dthorpe

Related questions
                            
                                Does CUDA really not have a calloc()-like API call?
                            
                                drm.ko missing for CUDA 6.5 / Ubuntu 14.04 / AWS EC2 GPU instance g2.2xlarge
                            
                                Cuda GPU is slower than CPU in simple numpy operation
                            
                                Is it possible to install cupy on google colab?
                            
                                CUDA threads, SMX, SP and blocks, how do they work?
                            
                                Why can't member variables be shared?
                            
                                Efficient way to compute 3D indexes from 1D array representation
                            
                                CUDA 7.0 Error while compiling samples
                            
                                For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence?
                            
                                Failed to compile cuda_ndarray.cu: libcublas.so.7.5: cannot open shared object file
                            
                                CUDA compiling error after installing it
                            
                                Vectorizing for cuda, a function that takes a complex number as input and a complex number as output fails in numba [closed]
                            
                                CUDA: cudaEvent_t and cudaThreadSynchronize usage
                            
                                Units of cuda registers
                            
                                What algorithm does OpenCV's Bayer conversion use?
                            
                                Meaning of following syntax of cuda Kernel
                            
                                Beginner CUDA - Simple var increment not working
                            
                                Half precision floating points in CUDA
                            
                                CUDA synchronization kernels
                            
                                libNVVM cannot be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA kernel doesn't launch

Tags:

cuda

gpgpu

gpu

Tarek

People also ask

2 Answers

harrism

dthorpe

Recent Activity

Donate For Us