Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not able to kill bad kernel running on NVIDIA GPU

I am in a real fix. Please help. Its urgent.

I have a host process that spawns multiple host(CPU) threads (pthreads). These threads in turn call the CUDA kernel. These CUDA kernels are written by external users. So it might be bad kernels that enter infinite loop. In order to overcome this I have put a time-out of 2 mins that will kill the corresponding CPU thread.

Will killing the CPU thread also kill the kernel running on the GPU? As far as what I have tested it does'nt.

How can I kill all the threads currently running in the GPU?

Edit: The reason I am using CPU threads that call the kernel is because, the sever has two Tesla GPU's. So the thread will schedule the kernel on the GPU device alternatively.

Thanks, Arvind

like image 689
arvindkgs Avatar asked Nov 06 '22 12:11

arvindkgs


2 Answers

It doesn't seem to. I ran a broken kernel and locked up one of my devices seemingly indefinitely (until reboot). I'm not sure how to kill running kernel. I think there is a way to limit kernel execution time via the driver, though, so that might be the way to go.

like image 101
interfect Avatar answered Nov 12 '22 17:11

interfect


Unless there's a larger part of this I'm not really getting, You might be better off using CUDA Streams api for multi-device tasking, but YMMV.

As for the killing; if you're running the cards with a display (and x server) attached, they will automatically timeout after 5 seconds (again, YMMV).

Assuming that this isn't the case; check out calling cudaDeviceReset() API Reference; from the 'parent' thread after your own prescribed 'kill' timeout.

I have not implemented this function in my own code yet so honestly have no idea if it'll work in your situation, but its worth investigation.

like image 24
Bolster Avatar answered Nov 12 '22 18:11

Bolster