SUMMARY I have a Lenovo Ideapad 5 Pro laptop with AMD iGPU and Nvidia dGPU. I mainly use AMD iGPU to run the display and the Nvidia dGPU is for offline GPU computations (Darktable etc.). That means that for most of the time the Nvidia GPU should stay in its D3cold power state (when I'm not running any applications using the dGPU). This itself works fine with current Nvidia drivers and dynamic power management. However, when I log in, ksystemstatsd always starts nvidia-smi process to do continuous polling in the background, and that polling keeps the dGPU awake (in its D0 power state). My normal laptop's idle power consumption is ~5W, but keeping the Nvidia GPU awake adds another 2W to it, i.e. 40 %! It would be nice if KDE system monitor could be configured to ignore certain GPUs. Currently my only workaround is to rename /usr/lib64/qt5/plugins/ksystemstats/ksystemstats_plugin_gpu.so so that it doesn't load on startup. STEPS TO REPRODUCE 1. Log in to KDE 2. Notice that nvidia-smi is running in the background, /sys/bus/pci/devices/0000\:01\:00.0/power_state shows always D0 3. kill nvidia-smi manually 4. After a couple of seconds, /sys/bus/pci/devices/0000\:01\:00.0/power_state is D3cold, power consumption of the laptop is reduced OBSERVED RESULT EXPECTED RESULT SOFTWARE/OS VERSIONS Windows: macOS: Linux/KDE Plasma: (available in About System) KDE Plasma Version: KDE Frameworks Version: Qt Version: ADDITIONAL INFORMATION
I made a small mistake in my testing, so a clarification is in order: If I don't use *any* GPU sensors (in System Monitor or System Monitor widget), then nvidia-smi is not started. But if I include any sensors from the AMD iGPU (like its temperature), it seems that nvidia-smi is started even though no Nvidia sensors are used. (I tried by adding AMD GPU temperature to System monitor widger, then logged out and in again. That started nvidia-smi in the background.)
Two instances of nvidia-smi are used by the GPU plugin: One to query the hardware for things like amount of memory and then one that is used to read current sensor values. The first is intended to only run once and then quit, the second should only be active if something on your system is making use of one of the sensors related to the NVidia GPU. Maybe the first process lingers instead of quitting? You can check with `nvidia-smi --query`.
Ok, now I think I know what happens: If I have just the AMD GPU active (using prime-select amd to configure X to use only AMD GPU) and configure System Monitor to show the AMD GPU's temperature, and if I then enable also the NVIDIA GPU for offline use (prime-select offload), after I log out and in again, nvidia-smi is started in the background and System Monitor seems to start showing Nvidia GPUs temperature instead. So the problem seems to be rather in recognizing the GPUs.
The GPU plugin queries udev for the available GPUs. If udev's order changes then the order of GPUs changes and where the AMD gpu was gpu0 it may then become the NVidia GPU. Maybe you can look at `/sys/class/drm/card*` and see if those change as well? If so, it seems like an upstream bug that we might need to work around somehow.
Created attachment 148673 [details] Snippet of System monitor when only AMD iGPU is enabled
Created attachment 148674 [details] The same System monitor snippet when Nvidia dGPU is enabled for offload use
Created attachment 148675 [details] Yet another System monitor snippet, now showing both GPUs
Here's what /sys/class/drm/card* show when only AMD iGPU is enabled (prime-select amd): /sys/class/drm/card0: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0 /sys/class/drm/card0-DP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1 /sys/class/drm/card0-eDP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1 /sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1 (/sys/class/drm/card0/device/vendor is 0x1002) Here's the same when Nvidia dGPU is enabled for offload use (prime-select offload): /sys/class/drm/card0: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0 /sys/class/drm/card0-DP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1 /sys/class/drm/card0-eDP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1 /sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1 /sys/class/drm/card1: symbolic link to ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/drm/card1 (/sys/class/drm/card0/device/vendor is still 0x1002, /sys/class/drm/card1/device/vendor is 0x10de) So, only /sys/class/drm/card1 was added. However, System-monitor-amd.png is a screenshot snippet showing System monitor when only AMD iGPU is enabled (I've added it to show GPU1 max RAM, which I've limited to 512 MB for the iGPU). Then, when I just run enable the Nvidia dGPU (run "prime-select offload") and log out and in again, System-monitor-offload1.png snippet shows that now System monitor's GPU1 max RAM is 4 GB (which is what the Nvidia dGPU has). And adding GPU2, System-monitor-offload2.png shows that GPU2 max RAM is the iGPU's 512 MB.
Ok so DRM has stable numbering but UDev does not. https://invent.kde.org/plasma/ksystemstats/-/merge_requests/35 This should fix things, though it may end up swapping devices for some people.
It seems that the mentioned merge request has been blocked, is there any kind of rough guess when the fix might end up in production?
Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra. Committed on 14/06/2022 at 11:36. Pushed by ahiemstra into branch 'master'. GPU: Query for DRM devices and use DRM number as card number The order in which PCI devices are enumerated can apparently change with some driver changes. This means that GPU 1 suddenly becomes GPU 2 and the other way around. The DRM subsystem does seem to have a consistent numbering for these devices, so query the DRM subsystem for devices and use its numbering for GPU indexing so that it remains stable. M +17 -13 plugins/gpu/LinuxBackend.cpp https://invent.kde.org/plasma/ksystemstats/commit/5eed0d51c0830ce1099e308e0326a5ff9b0ec82d
(In reply to Arjen Hiemstra from comment #11) > Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra. > Committed on 14/06/2022 at 11:36. > Pushed by ahiemstra into branch 'master'. > > GPU: Query for DRM devices and use DRM number as card number > > The order in which PCI devices are enumerated can apparently change with > some driver changes. This means that GPU 1 suddenly becomes GPU 2 and > the other way around. The DRM subsystem does seem to have a consistent > numbering for these devices, so query the DRM subsystem for devices and > use its numbering for GPU indexing so that it remains stable. > > M +17 -13 plugins/gpu/LinuxBackend.cpp > > https://invent.kde.org/plasma/ksystemstats/commit/ > 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d This fails here, as I have drm disabled in the kernel, so there is no /sys/class/drm.
Forgot to add: kernel 6.08, nvidia-drivers 525.53, rtx3060 Just reverting the changes used to work, with previous versions of nvidia drivers. With the current one all the values are 0, but I guess this is another bug.
*** Bug 465559 has been marked as a duplicate of this bug. ***