I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:
# Start the services
sudo docker-compose -f docker-compose-test.yml up --build
Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1 ... done
Recreating vw_image_cls_tensorflow_1 ... error
ERROR: for vw_image_cls_tensorflow_1 Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: for tensorflow Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
My docker-compose file looks as follows:
# version 2.3 is required for NVIDIA runtime
version: '2.3'
services:
nvidia-driver:
# NVIDIA GPU driver used by the CUDA Toolkit
image: nvidia/driver:440.33.01-ubuntu18.04
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Do we need this volume to make the driver accessible by other containers in the network?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
networks:
- net
nvidia-cuda:
depends_on:
- nvidia-driver
image: nvidia/cuda:10.1-base-ubuntu18.04
volumes:
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need to create an additional volume for this service to be accessible by the tensorflow service?
devices:
# Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
networks:
- net
tensorflow:
image: tensorflow/tensorflow:2.0.1-gpu # Does this ship with cuda10.0 installed or do I need a separate container for it?
runtime: nvidia
restart: always
privileged: true
depends_on:
- nvidia-cuda
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Volumes related to source code and config files
- ./src:/src
- ./configs:/configs
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need an additional volume from the nvidia-cuda service?
command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
devices:
# Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
- /dev/nvidia-uvm-tools
networks:
- net
volumes:
nvidia_driver:
networks:
net:
driver: bridge
And my /etc/docker/daemon.json
file looks as follows:
{"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:
docker-compose.yml
)?Thank you very much for your help, I highly appreciate it.
To use your GPU with Docker, begin by adding the NVIDIA Container Toolkit to your host. This integrates into Docker Engine to automatically configure your containers for GPU support. The Container Toolkit should now be operational. You're ready to start a test container.
The Docker engine doesn't natively support NVIDIA GPUs as it uses specialized hardware that requires the NVIDIA driver to be installed. This is our experience of using a graphics processing unit to build and run Docker containers and a step-by-step description of how this was achieved.
It is now possible to build CUDA container images for all supported architectures using Docker Buildkit in one step.
Run Docker Compose in detached mode by passing the -d flag to docker-compose up . Use docker-compose ps to review what you have running.
I agree that installing all tensorflow-gpu
dependencies is rather painful. Fortunately, it's rather easy with Docker, as you only need NVIDIA Driver and NVIDIA Container Toolkit (a sort of a plugin). The rest (CUDA
, cuDNN
) Tensorflow images have inside, so you don't need them on the Docker host.
The driver can be deployed as a container too, but I do not recommend that for a workstation. It is meant to be used on servers where there is no GUI (X-server, etc). The subject of containerized driver is covered at the end of this post, for now let's see how to start tensorflow-gpu
with docker-compose
. The process is the same regardless of whether you have the driver in container or not.
Prerequisites:
To enable GPU support for a container you need to create the container with NVIDIA Container Toolkit. There are two ways you can do that:
nvidia
container runtime. It is fine to do so as it works just as the default runtime unless some NVIDIA-specific environment variables are present (more on that later). This is done by placing "default-runtime": "nvidia"
into Docker's daemon.json
:/etc/docker/daemon.json:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
docker-compose
it is only possible with format version 2.3
.Here is a sample docker-compose.yml
to launch Tensorflow with GPU:
version: "2.3" # the only version where 'runtime' option is supported
services:
test:
image: tensorflow/tensorflow:2.3.0-gpu
# Make Docker create the container with NVIDIA Container Toolkit
# You don't need it if you set 'nvidia' as the default runtime in
# daemon.json.
runtime: nvidia
# the lines below are here just to test that TF can see GPUs
entrypoint:
- /usr/local/bin/python
- -c
command:
- "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
By running this with docker-compose up
you should see a line with the GPU specs in it. It appears at the end and looks like this:
test_1 | 2021-01-23 11:02:46.500189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 1624 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
And that is all you need to launch an official Tensorflow image with GPU.
As I mentioned, NVIDIA Container Toolkit works as the default runtime unless some variables are present. These are listed and explained here. You only need to care about them if you build a custom image and want to enable GPU support in it. Official Tensorflow images with GPU have them inherited from CUDA
images they use a base, so you only need to start the image with the right runtime as in the example above.
If you are interested in customising a Tensorflow image, I wrote another post on that.
As mentioned in the beginning, this is not something you want on a workstation. The process require you to start the driver container when no other display driver is loaded (that is via SSH, for example). Furthermore, at the moment of writing only Ubuntu 16.04, Ubuntu 18.04 and Centos 7 were supported.
There is an official guide and below are extractions from it for Ubuntu 18.04.
sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
If you are using an AWS kernel, ensure that the i2c_core
kernel module is enabled:
sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"
initramfs
:sudo update-initramfs -u
Now it's time to reboot for the changes to take place. After reboot check that no nouveau
or nvidia
modules are loaded. The commands below should return nothing:
lsmod | grep nouveau
lsmod | grep nvidia
The guide offers a command to run the driver, I prefer docker-compose
. Save the following as driver.yml
:
version: "3.0"
services:
driver:
image: nvidia/driver:450.80.02-ubuntu18.04
privileged: true
restart: unless-stopped
volumes:
- /run/nvidia:/run/nvidia:shared
- /var/log:/var/log
pid: "host"
container_name: nvidia-driver
Use docker-compose -f driver.yml up -d
to start the driver container. It will take a couple of minutes to compile modules for your kernel. You may use docker logs nvidia-driver -f
to overview the process, wait for 'Done, now waiting for signal' line to appear. Otherwise use lsmod | grep nvidia
to see if the driver modules are loaded. When it's ready you should see something like this:
nvidia_modeset 1183744 0
nvidia_uvm 970752 0
nvidia 19722240 17 nvidia_uvm,nvidia_modeset
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With