NVidia drivers not running on AWS after restarting the AMI

Tags:

everybody, I have the following problem:

I started a P2 instance with this AMI. I installed some tools like screen, torch, etc. Then I successfully run some experiments using GPU and I created an image of the instance, so that I can terminate it and run it again later.

Later I started a new instance from the AMI I created before. Everything looked fine - screen, torch, my experiments were present on the system, but I couldn't run the same experiments as before:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

To me it looks like the drivers might be installed (because all other tools are installed from before), but they are not running. Is it a correct assumption? How can I start them?

230

asked Oct 23 '16 08:10

Peter Hroššo

1 Answers

We had this problem recently. In our case, it seems that the default kernel on AWS instance was upgraded (from 4.4.0-1049-aws to 4.4.0-1061-aws), but the new kernel did not have nvidia modules installed:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1061-aws | grep -i nvidia
ubuntu@ip-XXX-XXX-XXX-XXX:~$ ls -laR /lib/modules/4.4.0-1049-aws | grep -i nvidia
-rw-r--r--  1 root root    87368 Jun 27 10:21 nvidia-drm.ko
-rw-r--r--  1 root root  1155304 Jun 27 10:21 nvidia-modeset.ko
-rw-r--r--  1 root root  1163016 Jun 27 10:21 nvidia-uvm.ko
-rw-r--r--  1 root root 18014088 Jun 27 10:21 nvidia.ko

Check your kernel version (uname -a) to see if this is the case for you. GRUB configuration allowed booting an old kernel image (1049), but by default it was loading the new one (1061). The relevant portion of /boot/grub/cfg:

ubuntu@ip-XXX-XXX-XXX-XXX:~$ grep -i -e "ubuntu, with linux" /boot/grub/grub.cfg
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1061-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1061-aws-recovery-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-advanced-XXXX' {
    menuentry 'Ubuntu, with Linux 4.4.0-1049-aws (recovery mode)' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4.4.0-1049-aws-recovery-XXXX' {

You can force that on the next reboot it loads the old kernel by using grub-reboot:

sudo /usr/sbin/grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-1049-aws"
sudo reboot

This will boot the instance with the old kernel, for which you have nvidia modules.

106

answered Oct 11 '22 00:10

user4760

Related questions
                            
                                What's the intended use of @JvmSynthetic in Kotlin?
                            
                                Curl_http_done: called premature
                            
                                Making my own framework and using it as an upstream for projects
                            
                                Why suddenly did all my apps start to crash with an EXC_CRASH (Code Signature Invalid) after sharing one of them via a cloud service?
                            
                                Recommended best practice for role claims as permissions
                            
                                AWS EMR Spark Python Logging
                            
                                asp.net core defaultProxy
                            
                                Git- How to kill ssh-agent properly on Linux
                            
                                Typescript access dynamic property with [' '] syntax
                            
                                Compile .vue file into .js file without webpack or browserify
                            
                                How to move and rename documents placed in several nested folders into a new single folder with python?
                            
                                Animate Auto Layout Constraints and Alpha on NSView

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With