运维日常踩坑——Nvidia驱动

驱动更新

CYQ August 6, 2019 views 5655 words

起因

服务器的Nvidia驱动从384更新到418之后,nvidia-smi命令报错:Failed to initialize NVML: Driver/library version mismatch

Stackoverflow上有关于这个问题的几个 work around,重启试过,没用。重新加载内核模块是有用的。但每次重启之后就会失效。

虽然没有用,但是可以得到的信息是:驱动更新后 linux 内核对应驱动的 kernel module 并没有重置, 外部相关进程引用了旧版本驱动相关的 module, 需要手动卸载。

解决方案

首先确认旧版本driver已经删除完毕,之后查看内核版本号

1
2
$ uname -r      
4.13.0-32-generic

检查内核模块的版本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ find /lib/modules/ -name "*nvidia*"
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-31-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-31-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-31-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-32-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-32-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-32-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_drm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_uvm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_modeset.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-26-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-26-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-26-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_modeset.ko

注意内核版本4.13.0-32-generic下出现了两个版本的内核模块,这就是问题所在了。

删除或者备份(比较保险)该目录下的nvidia模块

1
$ sudo mv /lib/modules/4.13.0-32-generic/updates/dkms/nvidia* ~/dkms-backup/

重新生成新版本驱动对应的内核模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
$ sudo dpkg-reconfigure nvidia-418                                          
Removing all DKMS Modules
Done.
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau from loading. This can be re/nvidia-graphics-drivers.conf.
A new initrd image has also been created. To revert, please replace /boot/initrd-4.13.0-32-generic with /boot

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

INFO:Enable nvidia-418
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Loading new nvidia-418-418.40.04 DKMS files...
Building only for 4.13.0-32-generic
Building for architecture x86_64
Building initial module for 4.13.0-32-generic
Done.

nvidia_418:
Running module version sanity check.
 - Original module
   - This kernel never originally had a module by this name
 - Installation
   - Installing to /lib/modules/4.13.0-32-generic/updates/dkms/

nvidia_418_modeset.ko:
Running module version sanity check.
 - Original module
   - This kernel never originally had a module by this name
 - Installation
   - Installing to /lib/modules/4.13.0-32-generic/updates/dkms/

nvidia_418_drm.ko:
Running module version sanity check.
 - Original module
   - This kernel never originally had a module by this name
 - Installation
   - Installing to /lib/modules/4.13.0-32-generic/updates/dkms/

nvidia_418_uvm.ko:
Running module version sanity check.
 - Original module
   - This kernel never originally had a module by this name
 - Installation
   - Installing to /lib/modules/4.13.0-32-generic/updates/dkms/

depmod....

DKMS: install completed.
Processing triggers for initramfs-tools (0.122ubuntu8.8) ...
update-initramfs: Generating /boot/initrd.img-4.13.0-32-generic
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast

此时应当已经只有新版本的模块在目录中了

1
2
$ ls /lib/modules/4.13.0-32-generic/updates/dkms                            
bbswitch.ko  nvidia_418_drm.ko  nvidia_418.ko  nvidia_418_modeset.ko  nvidia_418_uvm.ko

然后重启就完事儿了。