| Summary: | ksystemstatsd starts nvidia-smi, which prevents GPU from entering powersave on offline use | ||
|---|---|---|---|
| Product: | [Applications] plasma-systemmonitor | Reporter: | Matti Rintala <mrintala43> |
| Component: | general | Assignee: | KSysGuard Developers <ksysguard-bugs> |
| Status: | RESOLVED FIXED | ||
| Severity: | minor | CC: | ahiemstra, dvs-1111, emilianh, nate, plasma-bugs-null |
| Priority: | NOR | ||
| Version First Reported In: | 5.24.4 | ||
| Target Milestone: | --- | ||
| Platform: | openSUSE | ||
| OS: | Linux | ||
| Latest Commit: | https://invent.kde.org/plasma/ksystemstats/commit/5eed0d51c0830ce1099e308e0326a5ff9b0ec82d | Version Fixed/Implemented In: | 5.26 |
| Sentry Crash Report: | |||
| Attachments: |
Snippet of System monitor when only AMD iGPU is enabled
The same System monitor snippet when Nvidia dGPU is enabled for offload use Yet another System monitor snippet, now showing both GPUs |
||
|
Description
Matti Rintala
2022-05-02 07:14:18 UTC
I made a small mistake in my testing, so a clarification is in order: If I don't use *any* GPU sensors (in System Monitor or System Monitor widget), then nvidia-smi is not started. But if I include any sensors from the AMD iGPU (like its temperature), it seems that nvidia-smi is started even though no Nvidia sensors are used. (I tried by adding AMD GPU temperature to System monitor widger, then logged out and in again. That started nvidia-smi in the background.) Two instances of nvidia-smi are used by the GPU plugin: One to query the hardware for things like amount of memory and then one that is used to read current sensor values. The first is intended to only run once and then quit, the second should only be active if something on your system is making use of one of the sensors related to the NVidia GPU. Maybe the first process lingers instead of quitting? You can check with `nvidia-smi --query`. Ok, now I think I know what happens: If I have just the AMD GPU active (using prime-select amd to configure X to use only AMD GPU) and configure System Monitor to show the AMD GPU's temperature, and if I then enable also the NVIDIA GPU for offline use (prime-select offload), after I log out and in again, nvidia-smi is started in the background and System Monitor seems to start showing Nvidia GPUs temperature instead. So the problem seems to be rather in recognizing the GPUs. The GPU plugin queries udev for the available GPUs. If udev's order changes then the order of GPUs changes and where the AMD gpu was gpu0 it may then become the NVidia GPU. Maybe you can look at `/sys/class/drm/card*` and see if those change as well? If so, it seems like an upstream bug that we might need to work around somehow. Created attachment 148673 [details]
Snippet of System monitor when only AMD iGPU is enabled
Created attachment 148674 [details]
The same System monitor snippet when Nvidia dGPU is enabled for offload use
Created attachment 148675 [details]
Yet another System monitor snippet, now showing both GPUs
Here's what /sys/class/drm/card* show when only AMD iGPU is enabled (prime-select amd): /sys/class/drm/card0: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0 /sys/class/drm/card0-DP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1 /sys/class/drm/card0-eDP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1 /sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1 (/sys/class/drm/card0/device/vendor is 0x1002) Here's the same when Nvidia dGPU is enabled for offload use (prime-select offload): /sys/class/drm/card0: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0 /sys/class/drm/card0-DP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-DP-1 /sys/class/drm/card0-eDP-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-eDP-1 /sys/class/drm/card0-HDMI-A-1: symbolic link to ../../devices/pci0000:00/0000:00:08.1/0000:05:00.0/drm/card0/card0-HDMI-A-1 /sys/class/drm/card1: symbolic link to ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/drm/card1 (/sys/class/drm/card0/device/vendor is still 0x1002, /sys/class/drm/card1/device/vendor is 0x10de) So, only /sys/class/drm/card1 was added. However, System-monitor-amd.png is a screenshot snippet showing System monitor when only AMD iGPU is enabled (I've added it to show GPU1 max RAM, which I've limited to 512 MB for the iGPU). Then, when I just run enable the Nvidia dGPU (run "prime-select offload") and log out and in again, System-monitor-offload1.png snippet shows that now System monitor's GPU1 max RAM is 4 GB (which is what the Nvidia dGPU has). And adding GPU2, System-monitor-offload2.png shows that GPU2 max RAM is the iGPU's 512 MB. Ok so DRM has stable numbering but UDev does not. https://invent.kde.org/plasma/ksystemstats/-/merge_requests/35 This should fix things, though it may end up swapping devices for some people. It seems that the mentioned merge request has been blocked, is there any kind of rough guess when the fix might end up in production? Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra. Committed on 14/06/2022 at 11:36. Pushed by ahiemstra into branch 'master'. GPU: Query for DRM devices and use DRM number as card number The order in which PCI devices are enumerated can apparently change with some driver changes. This means that GPU 1 suddenly becomes GPU 2 and the other way around. The DRM subsystem does seem to have a consistent numbering for these devices, so query the DRM subsystem for devices and use its numbering for GPU indexing so that it remains stable. M +17 -13 plugins/gpu/LinuxBackend.cpp https://invent.kde.org/plasma/ksystemstats/commit/5eed0d51c0830ce1099e308e0326a5ff9b0ec82d (In reply to Arjen Hiemstra from comment #11) > Git commit 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d by Arjen Hiemstra. > Committed on 14/06/2022 at 11:36. > Pushed by ahiemstra into branch 'master'. > > GPU: Query for DRM devices and use DRM number as card number > > The order in which PCI devices are enumerated can apparently change with > some driver changes. This means that GPU 1 suddenly becomes GPU 2 and > the other way around. The DRM subsystem does seem to have a consistent > numbering for these devices, so query the DRM subsystem for devices and > use its numbering for GPU indexing so that it remains stable. > > M +17 -13 plugins/gpu/LinuxBackend.cpp > > https://invent.kde.org/plasma/ksystemstats/commit/ > 5eed0d51c0830ce1099e308e0326a5ff9b0ec82d This fails here, as I have drm disabled in the kernel, so there is no /sys/class/drm. Forgot to add: kernel 6.08, nvidia-drivers 525.53, rtx3060 Just reverting the changes used to work, with previous versions of nvidia drivers. With the current one all the values are 0, but I guess this is another bug. *** Bug 465559 has been marked as a duplicate of this bug. *** |