Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS EC2 instance losing GPU support after reboot

Rebooting an instance on tuesday, I first ran into the problem of losing GPU support on a AWS p2.xlarge machine with the Ubuntu Deep Learning AMI.

I tested it three times now on two days and a collegue had the same problem, so I guess it is a AWS bug. Though maybe someone has an idea how to debug it better.

Basically, after shutdown and reboot, the instance no longer has the nvidia module loaded in the kernel. Furthermore, according to dmesg, there seems to be a different kernel loaded. All of this happens without me actively causing it.

Here are the steps to reproduce the problem using a fresh instance and no custom code. I am working in Ireland (eu-west-1), the instance was launched in the Availability Zone eu-west-1a:

  • Launched an instance with the "Deep Learning AMI (Ubuntu) Version 21.2 (ami-0e9085a8d461c2d01)
  • Instance type: p2.xlarge, all defaults
  • Logged into instance, only ran the following four commands:
ubuntu@...:~$ lsmod | grep nvidia
nvidia              16592896  0
ipmi_msghandler        49152  1 nvidia
dmesg | less
...
[    0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
Tue Mar 19 16:41:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ubuntu@...:~$ sudo shutdown now
  • The instance does not shut down right away, maybe it is running updates that however I have NOT actively triggered.
  • After the state showed "stopped", started the instance again via the AWS Management Console
  • Ran the first three commands:
ubuntu@...:~$ lsmod | grep nvidia
(no output)
dmesg | less
...
[    0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

How could I force to boot with the kernel 4.4.0-1075-aws? Since it is hvm virtualization, I cannot choose a kernel directly in the dialog.

like image 445
Simon Avatar asked Mar 20 '19 12:03

Simon


People also ask

What happens when an EC2 instance is rebooted?

When you reboot an instance, it keeps its public DNS name (IPv4), private and public IPv4 address, IPv6 address (if applicable), and any data on its instance store volumes. Rebooting an instance doesn't start a new instance billing period (with a minimum one-minute charge), unlike stopping and starting your instance.

Do EC2 instances have GPU?

Amazon EC2 G4 instances feature NVIDIA T4 Tensor Core GPUs, providing access to one GPU or multiple GPUs, with different amounts of vCPU and memory.

What is GPU in EC2 instance?

Amazon EC2 Elastic GPUs are virtual machines (VMs), also known as compute instances, in the Amazon Web Services public cloud with added graphics acceleration capabilities.

How do I Reboot my AWS EC2 instance?

If you use the Amazon EC2 console, a command line tool, or the Amazon EC2 API to reboot your instance, we perform a hard reboot if the instance does not cleanly shut down within a few minutes. If you use AWS CloudTrail, then using Amazon EC2 to reboot your instance also creates an API record of when your instance was rebooted.

How do I set up GPU instances on Amazon ECS?

To use these instance types, you must either use the Amazon EC2 console, AWS CLI, or API and manually register the instances to your cluster. Before you begin working with GPUs on Amazon ECS, be aware of the following considerations: Your clusters can contain a mix of GPU and non-GPU container instances.

Can I use an NVIDIA GPU with Amazon EC2?

Amazon EC2 GPU-based container instances using the p2, p3, g3, and g4 instance types provide access to NVIDIA GPUs. For more information, see Linux Accelerated Computing Instances in the Amazon EC2 User Guide for Linux Instances .

Why is my EC2 instance stopped when I stop it?

If your instance is part of an Amazon EC2 Auto Scaling group, then stopping the instance might terminate it. Instances launched with Amazon EMR, AWS CloudFormation, AWS Elastic Beanstalk might be part of an AWS Auto Scaling group.


Video Answer


2 Answers

There seems to be a problem with building older NVIDIA drivers on 4.4.0-107x-aws kernels. You can install newer NVIDIA drivers, which should work fine with the current kernel:

wget http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run
sudo sh ./NVIDIA-Linux-x86_64-410.104.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd 

According to an AWS representative, the drivers were updated in the Deep Learning AMI on 21/03/2019 [AWS forums].

like image 137
alkamid Avatar answered Oct 18 '22 01:10

alkamid


I experienced the same issue and it helped me to do

sudo apt-get install nvidia-cuda-toolkit
sudo reboot

Good luck!

like image 4
melaanya Avatar answered Oct 18 '22 03:10

melaanya