On systems with an NVIDIA GPU, a simple apt upgrade
could leave you in a dreaded situation where
the GPU still works, but the NVIDA tooling like nvidia-smi
doesn’t work anymore. They just print
an error message like
$ nvidia-smi
NVML: Driver/library version mismatch
This happens when apt
upgrades the version of your tooling but the nvidia kernel modules were already
in memory before, so they’re still running with the previous version. The usual approach is to reboot
your machine. Sometimes this is not acceptable. In these cases, you can instead remove the now
outdated kernel modules and load the updated version.
The solution
In most cases, running this sequence of commands is sufficient:
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
nvidia-smi
The call to nvidia-smi
reloads the required kernel modules automatically.
Troubleshooting
We cannot remove kernel modules if another module depends on it. We first need to
remove all dependent modules. That’s why we remove nvidia_drm
, nvidia_modeset
, and nvidia_uvm
first. If a module has additional dependents not considered in this article, its removal will fail. For example, if we
tried to remove nvidia
first, rmmod
would print an error message. To find
additional dependent modules of nvidia
, run
lsmod | grep nvidia
Removing a module can also fail if a process is still using a device. In these cases, use
lsof
to get a list of these processes. For example, if the nvidia device plugin
for Kubernetes
prevents you from removing nvidua_uvm
, you’ll find this out with
sudo lsof /dev/nvidia_uvm
This might also interest you