How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?

Tags:

This is on windows 10 computer with no monitor attached to the Nvidia card. I've included output from nvida-smi showing > 5.04G was available.

Here is the tensorflow code asking it to allocate just slightly more than I had seen previously: (I want this to be as close as possible to memory fraction=1.0)

config = tf.ConfigProto()
#config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction=0.84
config.log_device_placement=True
sess = tf.Session(config=config)

Just before running the above line in a jupyter notebook I ran nvida-smi:

    +-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51                 Driver Version: 376.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106... WDDM  | 0000:01:00.0     Off |                  N/A |
|  0%   27C    P8     5W / 120W |     43MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Output from TF after it successfully allocates 5.01GB, shows "failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY" (you need to scroll to the right to see it below)

2017-12-17 03:53:13.959871: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.01GiB
2017-12-17 03:53:13.960006: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-12-17 03:53:13.961152: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_driver.cc:936] failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
2017-12-17 03:53:14.151073: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1

My best guess is some policy in an Nvidia user level dll is preventing use of all of the memory (perhaps to allow for attaching a monitor?)

If that theory is correct I'm looking for any user accessible knob to turn that off on windows 10. If I'm on the wrong track any help to point in the right direction is appreciated.

Edit #1:

I realized I did not include this bit of research: The following code in tensorflow indicates stream_exec is 'telling' TensorFlow that only 5.01GB is free. This is the primary reason for my current theory that some Nvidia component is preventing the allocation. (However I could be misunderstanding what component implements the instantiated stream_exec.)

auto stream_exec = executor.ValueOrDie();
int64 free_bytes;
int64 total_bytes;
if (!stream_exec->DeviceMemoryUsage(&free_bytes, &total_bytes)) {
  // Logs internally on failure.
  free_bytes = 0;
  total_bytes = 0;
}
const auto& description = stream_exec->GetDeviceDescription();
int cc_major;
int cc_minor;
if (!description.cuda_compute_capability(&cc_major, &cc_minor)) {
  // Logs internally on failure.
  cc_major = 0;
  cc_minor = 0;
}
LOG(INFO) << "Found device " << i << " with properties: "
          << "\nname: " << description.name() << " major: " << cc_major
          << " minor: " << cc_minor
          << " memoryClockRate(GHz): " << description.clock_rate_ghz()
          << "\npciBusID: " << description.pci_bus_id() << "\ntotalMemory: "
          << strings::HumanReadableNumBytes(total_bytes)
          << " freeMemory: " << strings::HumanReadableNumBytes(free_bytes);
}

Edit #2:

The thread below indicates Windows 10 is preventing full use of VRAM pervasively across secondary video cards used for compute by grabbing a % of the VRAM: https://social.technet.microsoft.com/Forums/windows/en-US/15b9654e-5da7-45b7-93de-e8b63faef064/windows-10-does-not-let-cuda-applications-to-use-all-vram-on-especially-secondary-graphics-cards?forum=win10itprohardware

This thread seems implausible given it would mean all windows 10 boxes are inherently worse than windows 7 for anything where VRAM on compute dedicated graphics cards could plausibly be the bottleneck.

Edit #3:

Update title to more clearly be a question. Feedback indicates this may be better as a bug to Microsoft or Nvidia. I am pursuing other avenues to get this addressed. However I don't want to assume this cannot be resolved directly.
Further experiments do indicate that the issue I am hitting is for the case of a large allocation from a single process. All of the VRAM can be used when another process comes into play.

Edit #4

The failure here is an allocation failure, and according to the NVIDIA-SMI above I have 43MiB in use (perhaps by the system?), but not by an identifiable process. The type of failure I'm seeing is of a monolithic single allocation. Under a typical allocation model that requires a continuous address space. So the pertinent question may be: What is causing that 43MiB to be used? Is that placed in the address space such that the 5.01 GB allocation is the max contiguous space available?

492

asked Dec 17 '17 12:12

Steve Steiner

1 Answers

It is clearly not possible for now, as Windows Display Driver Model 2.x has a limit defined, and no process can override it {Legally}.

Assuming you have played with "Prefer Maximum Performance Setting" with that you can push it to at max 92% with Power Supply.

This would help you in detail, if you like to know more about the WDDM 2.x:

https://docs.microsoft.com/en-us/windows-hardware/drivers/display/what-s-new-for-windows-threshold-display-drivers--wddm-2-0-

124

answered Nov 16 '22 01:11

N.K

Related questions
                            
                                When to use tf.resource and tf.variant?
                            
                                Tensorflow Estimator: Cache bottlenecks
                            
                                Exact model converging on keras-tf but not on keras
                            
                                looping through dataset once at test time in tensorflow
                            
                                Wide & Deep learning for large data error: GraphDef cannot be larger than 2GB
                            
                                Training of keras model get's slower after each repetition
                            
                                Is there a way to use tensorflow map_fn on GPU?
                            
                                Keras custom loss implementation : ValueError: An operation has `None` for gradient
                            
                                Tensorflow 2.0 Keras is training 4x slower than 2.0 Estimator
                            
                                Why does my keras LSTM model get stuck in an infinite loop?
                            
                                Can ReLU handle a negative input?
                            
                                PyTorch equivalence for softmax_cross_entropy_with_logits
                            
                                Graph optimizations on a tensorflow serveable created using tf.Estimator
                            
                                how is total loss calculated over multiple classes in Keras?
                            
                                Design patterns for tensorflow models
                            
                                Difference between tf.assign and assignment operator (=)
                            
                                About tf.nn.softmax_cross_entropy_with_logits_v2
                            
                                "Solving Environment" during `conda install -c <my_channel> tensorflow` takes 3+ min but changing the name a bit reduces the time significantly
                            
                                Tensorflow warning: The graph couldn't be sorted in topological order?
                            
                                Loading Images in a Directory As Tensorflow Data set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?

Tags:

tensorflow

cuda

windows-10

nvidia