How to run tensorflow with gpu support in docker-compose?

Tags:

I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:

# Start the services
sudo docker-compose -f docker-compose-test.yml up --build

Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1   ... done
Recreating vw_image_cls_tensorflow_1  ... error

ERROR: for vw_image_cls_tensorflow_1  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown

ERROR: for tensorflow  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.

My docker-compose file looks as follows:

# version 2.3 is required for NVIDIA runtime
version: '2.3'

services:
  nvidia-driver:
    # NVIDIA GPU driver used by the CUDA Toolkit
    image: nvidia/driver:440.33.01-ubuntu18.04
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
    # Do we need this volume to make the driver accessible by other containers in the network?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
    networks:
      - net

  nvidia-cuda:
    depends_on:
      - nvidia-driver
    image: nvidia/cuda:10.1-base-ubuntu18.04
    volumes:
    # Do we need the driver volume here?
     - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
     # Do we need to create an additional volume for this service to be accessible by the tensorflow service?
    devices:
      # Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
    networks:
      - net

  tensorflow:
    image: tensorflow/tensorflow:2.0.1-gpu  # Does this ship with cuda10.0 installed or do I need a separate container for it?
    runtime: nvidia
    restart: always
    privileged: true
    depends_on:
      - nvidia-cuda
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      # Volumes related to source code and config files
      - ./src:/src
      - ./configs:/configs
      # Do we need the driver volume here?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      # Do we need an additional volume from the nvidia-cuda service?
    command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
    devices:
      # Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
      - /dev/nvidia-uvm-tools
    networks:
      - net

volumes:
  nvidia_driver:

networks:
  net:
    driver: bridge

And my /etc/docker/daemon.json file looks as follows:

{"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:

Is it actually possible to do what I am trying to do?
If yes, did I setup my docker-compose file correctly (see comments in docker-compose.yml)?
How do I fix the error message I received above?

Thank you very much for your help, I highly appreciate it.

530

asked Feb 26 '20 16:02

Kevin Südmersen

1 Answers

I agree that installing all tensorflow-gpu dependencies is rather painful. Fortunately, it's rather easy with Docker, as you only need NVIDIA Driver and NVIDIA Container Toolkit (a sort of a plugin). The rest (CUDA, cuDNN) Tensorflow images have inside, so you don't need them on the Docker host.

The driver can be deployed as a container too, but I do not recommend that for a workstation. It is meant to be used on servers where there is no GUI (X-server, etc). The subject of containerized driver is covered at the end of this post, for now let's see how to start tensorflow-gpu with docker-compose. The process is the same regardless of whether you have the driver in container or not.

How to launch Tensorflow-GPU with docker-compose

Prerequisites:

docker & docker-compose
NVIDIA Container Toolkit & NVIDIA Driver

To enable GPU support for a container you need to create the container with NVIDIA Container Toolkit. There are two ways you can do that:

You can configure Docker to always use nvidia container runtime. It is fine to do so as it works just as the default runtime unless some NVIDIA-specific environment variables are present (more on that later). This is done by placing "default-runtime": "nvidia" into Docker's daemon.json:

/etc/docker/daemon.json:

{
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  },
  "default-runtime": "nvidia"
}

You can select the runtime during container creation. With docker-compose it is only possible with format version 2.3.

Here is a sample docker-compose.yml to launch Tensorflow with GPU:

version: "2.3"  # the only version where 'runtime' option is supported

services:
  test:
    image: tensorflow/tensorflow:2.3.0-gpu
    # Make Docker create the container with NVIDIA Container Toolkit
    # You don't need it if you set 'nvidia' as the default runtime in
    # daemon.json.
    runtime: nvidia
    # the lines below are here just to test that TF can see GPUs
    entrypoint:
      - /usr/local/bin/python
      - -c
    command:
      - "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"

By running this with docker-compose up you should see a line with the GPU specs in it. It appears at the end and looks like this:

test_1 | 2021-01-23 11:02:46.500189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 1624 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)

And that is all you need to launch an official Tensorflow image with GPU.

NVIDIA Environment Variables and custom images

As I mentioned, NVIDIA Container Toolkit works as the default runtime unless some variables are present. These are listed and explained here. You only need to care about them if you build a custom image and want to enable GPU support in it. Official Tensorflow images with GPU have them inherited from CUDA images they use a base, so you only need to start the image with the right runtime as in the example above.

If you are interested in customising a Tensorflow image, I wrote another post on that.

Host Configuration for NVIDIA driver in container

As mentioned in the beginning, this is not something you want on a workstation. The process require you to start the driver container when no other display driver is loaded (that is via SSH, for example). Furthermore, at the moment of writing only Ubuntu 16.04, Ubuntu 18.04 and Centos 7 were supported.

There is an official guide and below are extractions from it for Ubuntu 18.04.

Edit 'root' option in NVIDIA Container Toolkit settings:

sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml

Disable the Nouveau driver modules:

sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
  && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
  && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

If you are using an AWS kernel, ensure that the i2c_core kernel module is enabled:

sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"

Update the initramfs:

sudo update-initramfs -u

Now it's time to reboot for the changes to take place. After reboot check that no nouveau or nvidia modules are loaded. The commands below should return nothing:

lsmod | grep nouveau
lsmod | grep nvidia

Starting driver in container

The guide offers a command to run the driver, I prefer docker-compose. Save the following as driver.yml:

version: "3.0"
services:
  driver:
    image: nvidia/driver:450.80.02-ubuntu18.04
    privileged: true
    restart: unless-stopped
    volumes:
    - /run/nvidia:/run/nvidia:shared
    - /var/log:/var/log
    pid: "host"
    container_name: nvidia-driver

Use docker-compose -f driver.yml up -d to start the driver container. It will take a couple of minutes to compile modules for your kernel. You may use docker logs nvidia-driver -f to overview the process, wait for 'Done, now waiting for signal' line to appear. Otherwise use lsmod | grep nvidia to see if the driver modules are loaded. When it's ready you should see something like this:

nvidia_modeset       1183744  0
nvidia_uvm            970752  0
nvidia              19722240  17 nvidia_uvm,nvidia_modeset

answered Oct 02 '22 10:10

anemyte

Related questions
                            
                                Android Studio everytime starts with minimize window size
                            
                                How to use IronPython with App.Config?
                            
                                Row and column labelling of matrixes in LaTeX
                            
                                What names for standard website user roles? [closed]
                            
                                .NET Geometry Library [closed]
                            
                                How to add a background image to an Android Activity?
                            
                                Git/GitHub: fork, subtree merge or submodule for external code?
                            
                                Android Proguard, removing all Log statements and merging packages
                            
                                Error 175: The specified data store provider cannot be found
                            
                                URL shortening algorithm
                            
                                Are there more modern Membership/Security implementations than ASP.NET Membership Provider
                            
                                End user experience monitoring tools

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With