453283 – ksystemstatsd starts nvidia-smi, which prevents GPU from entering powersave on offline use

Bug 453283 - ksystemstatsd starts nvidia-smi, which prevents GPU from entering powersave on offline use

Summary: ksystemstatsd starts nvidia-smi, which prevents GPU from entering powersave o...

Status:	RESOLVED FIXED

Alias:	None

Product:	plasma-systemmonitor
Classification:	Applications
Component:	general (other bugs)
Version First Reported In:	5.24.4
Platform:	openSUSE Linux

Importance:	NOR minor
Target Milestone:	---
Assignee:	KSysGuard Developers

URL:
Keywords:

Duplicates (1):	465559 (view as bug list)
Depends on:
Blocks:

Reported:	2022-05-02 07:14 UTC by Matti Rintala
Modified:	2023-02-13 01:57 UTC (History)
CC List:	5 users (show)

See Also:
Latest Commit:	https://invent.kde.org/plasma/ksystemstats/commit/5eed0d51c0830ce1099e308e0326a5ff9b0ec82d
Version Fixed/Implemented In:	5.26
Sentry Crash Report:

Attachments
Snippet of System monitor when only AMD iGPU is enabled (6.16 KB, image/png) 2022-05-09 12:29 UTC, Matti Rintala	Details
The same System monitor snippet when Nvidia dGPU is enabled for offload use (10.82 KB, image/png) 2022-05-09 12:29 UTC, Matti Rintala	Details
Yet another System monitor snippet, now showing both GPUs (6.46 KB, image/png) 2022-05-09 12:30 UTC, Matti Rintala	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Matti Rintala 2022-05-02 07:14:18 UTC

SUMMARY
I have a Lenovo Ideapad 5 Pro laptop with AMD iGPU and Nvidia dGPU. I mainly use AMD iGPU to run the display and the Nvidia dGPU is for offline GPU computations (Darktable etc.). That means that for most of the time the Nvidia GPU should stay in its D3cold power state (when I'm not running any applications using the dGPU). This itself works fine with current Nvidia drivers and dynamic power management.

However, when I log in, ksystemstatsd always starts nvidia-smi process to do continuous polling in the background, and that polling keeps the dGPU awake (in its D0 power state). My normal laptop's idle power consumption is ~5W, but keeping the Nvidia GPU awake adds another 2W to it, i.e. 40 %!

It would be nice if KDE system monitor could be configured to ignore certain GPUs. Currently my only workaround is to rename /usr/lib64/qt5/plugins/ksystemstats/ksystemstats_plugin_gpu.so so that it doesn't load on startup.

STEPS TO REPRODUCE
1. Log in to KDE
2. Notice that nvidia-smi is running in the background, /sys/bus/pci/devices/0000\:01\:00.0/power_state shows always D0
3. kill nvidia-smi manually
4. After a couple of seconds, /sys/bus/pci/devices/0000\:01\:00.0/power_state is D3cold, power consumption of the laptop is reduced


OBSERVED RESULT


EXPECTED RESULT


SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION

Comment 1 Matti Rintala 2022-05-03 05:37:00 UTC

I made a small mistake in my testing, so a clarification is in order: If I don't use *any* GPU sensors (in System Monitor or System Monitor widget), then nvidia-smi is not started. But if I include any sensors from the AMD iGPU (like its temperature), it seems that nvidia-smi is started even though no Nvidia sensors are used.
(I tried by adding AMD GPU temperature to System monitor widger, then logged out and in again. That started nvidia-smi in the background.)

Comment 2 Arjen Hiemstra 2022-05-03 11:06:33 UTC

Two instances of nvidia-smi are used by the GPU plugin: One to query the hardware for things like amount of memory and then one that is used to read current sensor values. The first is intended to only run once and then quit, the second should only be active if something on your system is making use of one of the sensors related to the NVidia GPU. Maybe the first process lingers instead of quitting? You can check with `nvidia-smi --query`.

Comment 3 Matti Rintala 2022-05-04 09:42:05 UTC

Ok, now I think I know what happens:
If I have just the AMD GPU active (using prime-select amd to configure X to use only AMD GPU) and configure System Monitor to show the AMD GPU's temperature, and if I then enable also the NVIDIA GPU for offline use (prime-select offload), after I log out and in again, nvidia-smi is started in the background and System Monitor seems to start showing Nvidia GPUs temperature instead.
So the problem seems to be rather in recognizing the GPUs.

Comment 4 Arjen Hiemstra 2022-05-05 14:45:37 UTC

The GPU plugin queries udev for the available GPUs. If udev's order changes then the order of GPUs changes and where the AMD gpu was gpu0 it may then become the NVidia GPU. Maybe you can look at `/sys/class/drm/card*` and see if those change as well? If so, it seems like an upstream bug that we might need to work around somehow.

Comment 5 Matti Rintala 2022-05-09 12:29:06 UTC

Created attachment 148673 [details]
Snippet of System monitor when only AMD iGPU is enabled

Comment 6 Matti Rintala 2022-05-09 12:29:45 UTC

Created attachment 148674 [details]
The same System monitor snippet when Nvidia dGPU is enabled for offload use

Comment 7 Matti Rintala 2022-05-09 12:30:07 UTC

Created attachment 148675 [details]
Yet another System monitor snippet, now showing both GPUs

Comment 8 Matti Rintala 2022-05-09 12:35:51 UTC

Here's what /sys/class/drm/card* show when only AMD iGPU is enabled (prime-select amd):
/sys/class/drm/card0:          symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0
/sys/class/drm/card0-DP-1:     symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1
/sys/class/drm/card0-eDP-1:    symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1
/sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1
(/sys/class/drm/card0/device/vendor is 0x1002)

Here's the same when Nvidia dGPU is enabled for offload use (prime-select offload):
/sys/class/drm/card0:          symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0
/sys/class/drm/card0-DP-1:     symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1
/sys/class/drm/card0-eDP-1:    symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1
/sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1
/sys/class/drm/card1:          symbolic link to ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/drm/card1
(/sys/class/drm/card0/device/vendor is still 0x1002, /sys/class/drm/card1/device/vendor is 0x10de)

So, only /sys/class/drm/card1 was added.

However, System-monitor-amd.png is a screenshot snippet showing System monitor when only AMD iGPU is enabled (I've added it to show GPU1 max RAM, which I've limited to 512 MB for the iGPU). Then, when I just run enable the Nvidia dGPU (run "prime-select offload") and log out and in again, System-monitor-offload1.png snippet shows that now System monitor's GPU1 max RAM is 4 GB (which is what the Nvidia dGPU has). And adding GPU2, System-monitor-offload2.png shows that GPU2 max RAM is the iGPU's 512 MB.

Comment 9 Arjen Hiemstra 2022-05-09 15:09:52 UTC

Ok so DRM has stable numbering but UDev does not.

https://invent.kde.org/plasma/ksystemstats/-/merge_requests/35

This should fix things, though it may end up swapping devices for some people.

Comment 10 Matti Rintala 2022-05-30 09:50:14 UTC

It seems that the mentioned merge request has been blocked, is there any kind of rough guess when the fix might end up in production?

Comment 11 Arjen Hiemstra 2022-06-14 11:43:25 UTC

Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra.
Committed on 14/06/2022 at 11:36.
Pushed by ahiemstra into branch 'master'.

GPU: Query for DRM devices and use DRM number as card number

The order in which PCI devices are enumerated can apparently change with
some driver changes. This means that GPU 1 suddenly becomes GPU 2 and
the other way around. The DRM subsystem does seem to have a consistent
numbering for these devices, so query the DRM subsystem for devices and
use its numbering for GPU indexing so that it remains stable.

M  +17   -13   plugins/gpu/LinuxBackend.cpp

https://invent.kde.org/plasma/ksystemstats/commit/5eed0d51c0830ce1099e308e0326a5ff9b0ec82d

Comment 12 emilianh 2022-11-16 06:55:07 UTC

(In reply to Arjen Hiemstra from comment #11)
> Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra.
> Committed on 14/06/2022 at 11:36.
> Pushed by ahiemstra into branch 'master'.
> 
> GPU: Query for DRM devices and use DRM number as card number
> 
> The order in which PCI devices are enumerated can apparently change with
> some driver changes. This means that GPU 1 suddenly becomes GPU 2 and
> the other way around. The DRM subsystem does seem to have a consistent
> numbering for these devices, so query the DRM subsystem for devices and
> use its numbering for GPU indexing so that it remains stable.
> 
> M  +17   -13   plugins/gpu/LinuxBackend.cpp
> 
> https://invent.kde.org/plasma/ksystemstats/commit/
> 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d

This fails here, as I have drm disabled in the kernel, so there is no /sys/class/drm.

Comment 13 emilianh 2022-11-16 07:00:18 UTC

Forgot to add:
kernel 6.08, nvidia-drivers 525.53, rtx3060

Just reverting the changes used to work, with previous versions of nvidia drivers. With the current one all the values are 0, but I guess this is another bug.

Comment 14 Nate Graham 2023-02-13 01:57:44 UTC

*** Bug 465559 has been marked as a duplicate of this bug. ***