Rebooting an instance on tuesday, I first ran into the problem of losing GPU support on a AWS p2.xlarge machine with the Ubuntu Deep Learning AMI. I tested it three times now on two days and a collegue had the same problem, so I guess it is a AWS bug. Though maybe someone has an idea how to debug it better. Basically, after shutdown and reboot, the instance no longer has the nvidia module loaded in the kernel. Furthermore, according to dmesg, there seems to be a different kernel loaded. All of this happens without me actively causing it. Here are the steps to reproduce the problem using a fresh instance and no custom code. I am working in Ireland (eu-west-1), the instance was launched in the Availability Zone eu-west-1a: <ul> <li>Launched an instance with the "Deep Learning AMI (Ubuntu) Version 21.2 (ami-0e9085a8d461c2d01)</li> <li>Instance type: p2.xlarge, all defaults</li> <li>Logged into instance, only ran the following four commands:</li> </ul> <pre class="prettyprint"><code>ubuntu@...:~$ lsmod | grep nvidia nvidia 16592896 0 ipmi_msghandler 49152 1 nvidia </code></pre> <pre class="prettyprint"><code>dmesg | less ... [ 0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167) [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295 ... </code></pre> <pre class="prettyprint"><code>ubuntu@...:~$ nvidia-smi Tue Mar 19 16:41:53 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | | N/A 42C P8 32W / 149W | 0MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ </code></pre> <pre class="prettyprint"><code>ubuntu@...:~$ sudo shutdown now </code></pre> <ul> <li>The instance does not shut down right away, maybe it is running updates that however I have NOT actively triggered.</li> <li>After the state showed "stopped", started the instance again via the AWS Management Console</li> <li>Ran the first three commands:</li> </ul> <pre class="prettyprint"><code>ubuntu@...:~$ lsmod | grep nvidia (no output) </code></pre> <pre class="prettyprint"><code>dmesg | less ... [ 0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170) [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295 ... </code></pre> <pre class="prettyprint"><code>ubuntu@...:~$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. </code></pre> How could I force to boot with the kernel 4.4.0-1075-aws? Since it is hvm virtualization, I cannot choose a kernel directly in the dialog.

I experienced the same issue and it helped me to do <pre class="prettyprint"><code>sudo apt-get install nvidia-cuda-toolkit sudo reboot </code></pre> Good luck!

AWS EC2 instance losing GPU support after reboot

Tags:

nvidia

Rebooting an instance on tuesday, I first ran into the problem of losing GPU support on a AWS p2.xlarge machine with the Ubuntu Deep Learning AMI.

I tested it three times now on two days and a collegue had the same problem, so I guess it is a AWS bug. Though maybe someone has an idea how to debug it better.

Basically, after shutdown and reboot, the instance no longer has the nvidia module loaded in the kernel. Furthermore, according to dmesg, there seems to be a different kernel loaded. All of this happens without me actively causing it.

Here are the steps to reproduce the problem using a fresh instance and no custom code. I am working in Ireland (eu-west-1), the instance was launched in the Availability Zone eu-west-1a:

Launched an instance with the "Deep Learning AMI (Ubuntu) Version 21.2 (ami-0e9085a8d461c2d01)
Instance type: p2.xlarge, all defaults
Logged into instance, only ran the following four commands:

ubuntu@...:~$ lsmod | grep nvidia
nvidia              16592896  0
ipmi_msghandler        49152  1 nvidia

dmesg | less
...
[    0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...

ubuntu@...:~$ nvidia-smi
Tue Mar 19 16:41:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ubuntu@...:~$ sudo shutdown now

The instance does not shut down right away, maybe it is running updates that however I have NOT actively triggered.
After the state showed "stopped", started the instance again via the AWS Management Console
Ran the first three commands:

ubuntu@...:~$ lsmod | grep nvidia
(no output)

dmesg | less
...
[    0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...

ubuntu@...:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

How could I force to boot with the kernel 4.4.0-1075-aws? Since it is hvm virtualization, I cannot choose a kernel directly in the dialog.

445

asked Mar 20 '19 12:03

Simon

Video Answer

2 Answers

There seems to be a problem with building older NVIDIA drivers on 4.4.0-107x-aws kernels. You can install newer NVIDIA drivers, which should work fine with the current kernel:

wget http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run
sudo sh ./NVIDIA-Linux-x86_64-410.104.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd

According to an AWS representative, the drivers were updated in the Deep Learning AMI on 21/03/2019 [AWS forums].

137

answered Oct 18 '22 01:10

alkamid

I experienced the same issue and it helped me to do

sudo apt-get install nvidia-cuda-toolkit
sudo reboot

Good luck!

answered Oct 18 '22 03:10

melaanya

Related questions
                            
                                Websockets with AWS and Elastic Beanstalk
                            
                                LaunchWaitCondition failed. The expected number of EC2 instances were not initialized within the given time
                            
                                AWS Cloudwatch Heartbeat Alarm
                            
                                having trouble installing awslogs agent
                            
                                AWS Codepipeline with a Codecommit targetsource repository from another account
                            
                                AWS DynamoDB data with and/or without types?
                            
                                Route53 for AWS Elastic Search Domain gives certificate error
                            
                                AWS Glue Crawler adding tables for every partition?
                            
                                Getting Sequelize.js library to work on Amazon Lambda
                            
                                AWS CloudFormation - Create Tables After RDS Instance Is Ready?
                            
                                Hosting multiple SPA web apps on S3 + Cloudfront under same URL
                            
                                next.js export static - S3 - routing fails on page reload
                            
                                Getting 403 ACCESS DENIED error when deploying React Web app on AWS Amplify
                            
                                Not authorized to perform sts:AssumeRoleWithWebIdentity- 403
                            
                                Has anyone managed to get SPDY to work behind an Amazon ELB?
                            
                                How can I list all my Amazon EC2 instances using Node.js in AWS Lambda?
                            
                                How to setup a user with sysadmin rights on Amazon RDS
                            
                                Authentication or permission failure, did not have permissions on the remote directory
                            
                                Deploy docker on AWS beanstalk with docker composer
                            
                                Nginx proxy_pass to aws Api Gateway

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS EC2 instance losing GPU support after reboot

Tags:

amazon-web-services

amazon-ec2

kernel

gpu