I have two GPUs. My program uses TensorRT and Tensorflow.
When I run only TensorRT part, it is fine. When I run together with Tensorflow part, I have error as
[TensorRT] ERROR: engine.cpp (370) - Cuda Error in ~ExecutionContext: 77 (an illegal memory access was encountered)
terminate called after throwing an instance of 'nvinfer1::CudaError'
what(): std::exception
The issue is when Tensorflow session starts as follow
self.graph = tf.get_default_graph()
self.persistent_sess = tf.Session(graph=self.graph, config=tf_config)
It loads two GPUs as
2019-06-06 14:15:04.420265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6965 MB memory) -> physical GPU (device: 0, name: Quadro P4000, pci bus id: 0000:04:00.0, compute capability: 6.1)
2019-06-06 14:15:04.420713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7159 MB memory) -> physical GPU (device: 1, name: Quadro P4000, pci bus id: 0000:05:00.0, compute capability: 6.1)
I tried to load only one GPU as
(1)Putting on top of the python code
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
(2)
with tf.device('/device:GPU:0'):
self.graph = tf.get_default_graph()
self.persistent_sess = tf.Session(graph=self.graph, config=tf_config)
Both don't work.
How to solve the problem?
TensorFlow code, and tf.keras models will transparently run on a single GPU with no code changes required. Note: Use tf.config.list_physical_devices ('GPU') to confirm that TensorFlow is using the GPU. The simplest way to run on multiple GPUs, on one or many machines, is using Distribution Strategies.
If a TensorFlow operation has both CPU and GPU implementations, by default the GPU devices will be given priority when the operation is assigned to a device. For example, tf.matmul has both CPU and GPU kernels.
Tensorflow/Keras also allows to specify gpu to be used with session config. I can recommend it only if setting environment variable is not an options (i.e. an MPI run). Because it tend to be the least reliable of all methods, especially with keras.
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes. tf.distribute.Strategy has been designed with these key goals in mind:
I could manage to load only one GPU be placing the following lines at the first line of the python code.
import sys, os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With