Tensorflow on shared GPUs: how to automatically select the one that is unused

Tags:

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).

Others have access too and sometimes they take random gpus. I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.

I would like to go throuh the gpus usage and find the first that is unused and use only this one. I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.

I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?

638

asked Jan 13 '17 12:01

jeandut

2 Answers

I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.

Something like this:

import subprocess, re

# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23

def run_command(cmd):
    """Run command, return output as string."""
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
    return output.decode("ascii")

def list_available_gpus():
    """Returns list of available GPU ids."""
    output = run_command("nvidia-smi -L")
    # lines of the form GPU 0: TITAN X
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
    result = []
    for line in output.strip().split("\n"):
        m = gpu_regex.match(line)
        assert m, "Couldnt parse "+line
        result.append(int(m.group("gpu_id")))
    return result

def gpu_memory_map():
    """Returns map of GPU id to memory allocated on that GPU."""

    output = run_command("nvidia-smi")
    gpu_output = output[output.find("GPU Memory"):]
    # lines of the form
    # |    0      8734    C   python                                       11705MiB |
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
    rows = gpu_output.split("\n")
    result = {gpu_id: 0 for gpu_id in list_available_gpus()}
    for row in gpu_output.split("\n"):
        m = memory_regex.search(row)
        if not m:
            continue
        gpu_id = int(m.group("gpu_id"))
        gpu_memory = int(m.group("gpu_memory"))
        result[gpu_id] += gpu_memory
    return result

def pick_gpu_lowest_memory():
    """Returns GPU with the least allocated memory"""

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
    best_memory, best_gpu = sorted(memory_gpu_map)[0]
    return best_gpu

You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE

import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow

answered Oct 14 '22 03:10

Yaroslav Bulatov

An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.

answered Oct 14 '22 02:10

Trisoloriansunscreen

Related questions
                            
                                How to run eval.py job for tensorflow object detection models
                            
                                AttributeError: module 'tensorflow' has no attribute 'name_scope' with Keras
                            
                                word2vec - get nearest words
                            
                                How to use Merge layer (concat function) on Keras 2.0.0?
                            
                                Keras and TensorBoard - AttributeError: 'Sequential' object has no attribute '_get_distribution_strategy'
                            
                                Tensorflow: How can I assign numpy pre-trained weights to subsections of graph?
                            
                                How to create dataset in the same format as the FSNS dataset?
                            
                                How do I load a keras saved model with custom Optimizer
                            
                                What's state_size of a MultiRNNCell in TensorFlow?
                            
                                Why the negative reshape (-1) in MNIST tutorial?
                            
                                Python Keras: An layer output exactly the same thing as input
                            
                                How to import pre-downloaded MNIST dataset from a specific directory or folder?
                            
                                Custom weight initialization tensorflow tf.layers.dense
                            
                                Cannot connect to X server GOOGLE COLAB
                            
                                Python in R - Error: could not find a Python environment for /usr/bin/python
                            
                                How to load a graph with tensorflow.so and c_api.h in c++ language?
                            
                                How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment
                            
                                Keras VGG16 preprocess_input modes
                            
                                Tensor object has no attribute keras_shape
                            
                                additive Gaussian noise in Tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Tensorflow on shared GPUs: how to automatically select the one that is unused

Tags:

tensorflow

gpu

distributed-system

jeandut

People also ask

2 Answers

Yaroslav Bulatov

Trisoloriansunscreen

Recent Activity

Donate For Us