I have a CUDA code which works like below:
cpyDataGPU --> CPU
while(nsteps){
cudaKernel1<<<,>>>
function1();
cudaKernel2<<<,>>>
}
cpyDataGPU --> CPU
And function1 is like that:
function1{
cudaKernel3<<<,>>>
cudaKernel4<<<,>>>
cpyNewNeedDataCPU --> GPU // Error line
cudaKernel5<<<,>>>
}
According to cudaMemcpy documentation, this function, can produce 4 differents error codes: "cudaSuccess", "cudaErrorInvalidValue", "cudaErrorInvalidDevicePointer" and "cudaErrorInvalidMemcpyDirection".
However, I get the following error: "cudaErrorLaunchFailure": "An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used untilcudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA."
Does anybody have any idea about why am I getting this error¿? What am I doing wrong¿?
Does it make sense, to copy data CPU-->GPU after previous kernel callings ¿? The problem is that, I have to copy that data here at each step because it may change in each "while" step.
Thaks a lot in advance!!
The documentation you linked also says:
Note that this function may also return error codes from previous, asynchronous launches.
When you call cudaMemcpy()
the program will wait for all preceding GPU work to complete (remember that kernel launches are asynchronous), then check the status and execute the memcpy if everything is ok. In this case, however, one of your kernels has failed.
The most common reason for this error is an out-of-bounds access, much like a segfault in x86 territory.
cudaErrorLaunchFailure : An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.
The easiest way to debug this would be to use cuda-memcheck. Alternatively you can identify which kernel failed by calling cudaDeviceSynchronize()
after each kernel launch and checking the return value.
Are you checking the error status after calling your kernels? Because (almost?) all cuda calls may return an error from a previous failed call or kernel. Since you are getting a launch failure, I suspect one of the kernels before the copy is the real source of the error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With