Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenCL: Distinguishing computation failure from TDR interrupt

When running long OpenCL computations on Windows using the GPU that also runs the main display, the OS may interrupt the computation with Timeout Detection and Recovery.

In my experience (Java, using JavaCL by NativeLibs4Java, with an NVidia GPU) this manifests as an "Out Of Resources" (cl_out_of_resources) error when ivoking clEnqueueReadBuffer.

The problem is that I get the exact same message when the OpenCL program for other reasons (e.g., because of accessing invalid memory).

Is there a (semi) reliable way to distinguish between an "Out of Resources" caused by TDR and an "Out of Resources" caused by other problems?

Alternately, can I at least reliably (in Java / through OpenCL API) determine that the GPU used for computation is also running the display?

I am aware of this question however, the answer there is concerned with scenarios when clFinish does not return, which is not a problem for me (my code so far never stayed frozen within the OpenCL API).

like image 693
Martin Modrák Avatar asked Nov 09 '16 09:11

Martin Modrák


1 Answers

Is there a (semi) reliable way to distinguish between an "Out of Resources" caused by TDR and an "Out of Resources" caused by other problems?

1)

If you can access

KeyPath   :
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrDelay ValueType : REG_DWORD ValueData : Number of
seconds to delay. 2 seconds is the default value.

from WMI to multiply it by

KeyPath   : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrLimitCount
ValueType : REG_DWORD
ValueData : Number of TDRs before crashing. The default value is 5.

again with WMI. You get 10 seconds when you multiply these. And, you should get

KeyPath   :
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrLimitTime ValueType : REG_DWORD ValueData : Number of
seconds before crashing. 60 seconds is the default value.

that should read 60 seconds from WMI.

For this example computer, it takes 5 x 2-second+1 extra delays before 60 seconds final to crash limit. Then you can check from application if last stopwatch counter exceeded those limits. If yes, probably it is TDR. There is also a thread-exit-from-driver time limit on top of these,

KeyPath   :
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrDdiDelay ValueType : REG_DWORD ValueData : Number of
seconds to leave the driver. 5 seconds is the default value.

which is 5 seconds default. Accessing an invalid memory segment should exit quicker. Maybe you can increase these TDR time limits from WMI up to some minutes so it can let the program compute without crashing becauso of preemption starvation. But changing registry could be dangerous, for example you set TDR time limit to 1 second or some slice of it, then windows may never boot without constant TDR crashes so just reading those variables must be safer.

2)

You separate total work into much smaller parts. If data is not separable, copy it once, then start enqueueing the long-runnning kernel as very-short-ranged-kernels n times with some waiting between any two.

Then, you must be sure that TDR is elliminated. If this version runs but the long-running-kernel doesn't, it is TDR fault.If it is opposite, it is memory crash. Looks like this:

short running x 1024 times
long running
long running <---- fail? TDR! because memory would crash short ver. too!
long running

another try:

short running x 1024 times <---- fail? memory! because only 1ms per kernel
long running
long running 
long running

Alternately, can I at least reliably (in Java / through OpenCL API) determine that the GPU used for computation is also running the display?

1)

Use interoperability properties of both devices:

// taken from Intel's site:
std::vector<cl_device_id> devs (devNum);
//reading the info
clGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, bytes, devs, NULL))

this gives interoperable devices list. You should get its id to exclude it if you don't want to use it.

2)

Have another thread run some opengl or directx static object drawing code to keep one of the gpus busy. Then test all gpus simultaneously using another thread for some trivial opencl kernel codes. Test:

  • opengl starts drawing something with high triangle count @60 fps.
  • start devices for opencl compute, get average kernel executions per second
  • device 1: 30 keps
  • device 2: 40 keps
  • after a while, stop opengl and close its windows(if not already)
  • device 1: 75 keps -----> highest increase in percentage!-->display!!!
  • device 2: 41 keps ----> not as high increase but it can

you should not copy any data between devices while doing this so CPU/RAM will not be bottleneck.

3)

If data is separable, then you can use a divide-and-conquer algorithm to give any gpu get its own work only when it is available and let display part more flexibility (because this is performance-aware solution and could be similar to short-running version but scheduling is done on multiple gpus)

4)

I didn't check because I sold my 2nd gpu but, you should try

CL_DEVICE_TYPE_DEFAULT

in your multi-gpu system to test if it gets display gpu or not. Shut down pc, plug monitor cable to other card, try again. Shut down, change seats of cards, try again. Shut down, remove one of the cards so only 1 gpu and 1 cpu is left, try again. If all these give only display gpu then it should be marking display gpu as default.

like image 91
huseyin tugrul buyukisik Avatar answered Nov 07 '22 00:11

huseyin tugrul buyukisik