起因
服务器的Nvidia驱动从384更新到418之后,nvidia-smi
命令报错:Failed to initialize NVML: Driver/library version mismatch
Stackoverflow上有关于这个问题的几个 work around,重启试过,没用。重新加载内核模块是有用的。但每次重启之后就会失效。
虽然没有用,但是可以得到的信息是:驱动更新后 linux 内核对应驱动的 kernel module 并没有重置, 外部相关进程引用了旧版本驱动相关的 module, 需要手动卸载。
解决方案
首先确认旧版本driver已经删除完毕,之后查看内核版本号
1
2
$ uname -r
4.13.0-32-generic
检查内核模块的版本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ find /lib/modules/ -name "*nvidia*"
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-40-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-38-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-31-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-31-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-31-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-31-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-32-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-32-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-32-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_drm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_uvm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_418_modeset.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-32-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.13.0-26-generic/kernel/drivers/video/fbdev/nvidia
/lib/modules/4.13.0-26-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.13.0-26-generic/kernel/drivers/net/ethernet/nvidia
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.13.0-26-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-42-generic/updates/dkms/nvidia_384_modeset.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_drm.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_uvm.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384.ko
/lib/modules/4.10.0-37-generic/updates/dkms/nvidia_384_modeset.ko
注意内核版本4.13.0-32-generic
下出现了两个版本的内核模块,这就是问题所在了。
删除或者备份(比较保险)该目录下的nvidia模块
1
$ sudo mv /lib/modules/4.13.0-32-generic/updates/dkms/nvidia* ~/dkms-backup/
重新生成新版本驱动对应的内核模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
$ sudo dpkg-reconfigure nvidia-418
Removing all DKMS Modules
Done.
update-initramfs: deferring update (trigger activated)
A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau from loading. This can be re/nvidia-graphics-drivers.conf.
A new initrd image has also been created. To revert, please replace /boot/initrd-4.13.0-32-generic with /boot
*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***
*****************************************************************************
INFO:Enable nvidia-418
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Loading new nvidia-418-418.40.04 DKMS files...
Building only for 4.13.0-32-generic
Building for architecture x86_64
Building initial module for 4.13.0-32-generic
Done.
nvidia_418:
Running module version sanity check.
- Original module
- This kernel never originally had a module by this name
- Installation
- Installing to /lib/modules/4.13.0-32-generic/updates/dkms/
nvidia_418_modeset.ko:
Running module version sanity check.
- Original module
- This kernel never originally had a module by this name
- Installation
- Installing to /lib/modules/4.13.0-32-generic/updates/dkms/
nvidia_418_drm.ko:
Running module version sanity check.
- Original module
- This kernel never originally had a module by this name
- Installation
- Installing to /lib/modules/4.13.0-32-generic/updates/dkms/
nvidia_418_uvm.ko:
Running module version sanity check.
- Original module
- This kernel never originally had a module by this name
- Installation
- Installing to /lib/modules/4.13.0-32-generic/updates/dkms/
depmod....
DKMS: install completed.
Processing triggers for initramfs-tools (0.122ubuntu8.8) ...
update-initramfs: Generating /boot/initrd.img-4.13.0-32-generic
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast
此时应当已经只有新版本的模块在目录中了
1
2
$ ls /lib/modules/4.13.0-32-generic/updates/dkms
bbswitch.ko nvidia_418_drm.ko nvidia_418.ko nvidia_418_modeset.ko nvidia_418_uvm.ko
然后重启就完事儿了。
This work is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.