SUMMARY Using a system monitor widget set to monitor NVIDIA GPU activity/stats causes nvidia-smi to be opened repeatedly--however, this seems to be the cause of hitching in the Plasma Wayland session with either NV drivers 555 or 560. STEPS TO REPRODUCE 1. Using an NVIDIA GPU without `nvidia.NVreg_EnableGpuFirmware=0` set, have a System Monitor sensor widget on a Plasma panel set to monitor said GPU. 2. Use the Wayland session--observe performance of window dragging or desktop effects be slower and noticeably choppy. OBSERVED RESULT Plasma desktop performance appears choppy with bad performance, almost like it's running at half the display refresh rate. EXPECTED RESULT Plasma desktop should be smooth. SOFTWARE/OS VERSIONS Linux/KDE Plasma: Kernel 6.9 KDE Plasma Version: 6.1.5 KDE Frameworks Version: 6.4.0 Qt Version: 6.7.2 ADDITIONAL INFORMATION Might be related to bug #487728 as it also only affects the Wayland session, though hard to say if this is also a supporting cause to Wl plasmashell crashes. Was suggested by an Nvidia contributor @ https://github.com/NVIDIA/open-gpu-kernel-modules/issues/538#issuecomment-2251021404
The problem with using the suggested library is that the headers are in a proprietary SDK that cannot be freely distributed, which means that it would make the NVidia GPU integration practically unbuildable on most machines. Even if we were to include the header in ksystemstats (which its license doesn't actually allow, but I see some projects do) we'd still be stuck since the library itself is bundled in the driver and that is generally also not installed on build machines. So ultimately, running `nvidia-smi` is pretty much the only way we can support this without introducing a nasty build system issue. And frankly, it seems to me that it's an upstream issue anyway? Running `nvidia-smi` shouldn't have such an impact in the first place?
Hey there. Sorry, I misread the relevant code and thought that you were constantly spawning and killing the nvidia-smi process. IIUC that is not the case, you're just running `nvidia-smi pmon` in the background and parsing its output. This bypasses the common problem other monitoring tools had where the setup/teardown was causing the issue. But, I now see where just `nvidia-smi pmon` can be a cause of stutter, because it is fetching a lot of data out of the GSP via NV2080_CTRL_CMD_PERF_GET_GPUMON_PERFMON_UTIL_SAMPLES_V2. Switching to NVML would not fix this. I'll have to look a bit deeper and see if there's a better way to get the needed info; and/or if this can get fixed in NVML or the driver itself.
Thanks Milos, looking forward to that!
Just an update here that 565.xx versions of NVIDIA driver have a change that should make this much much less noticeable. There is still a bit of suboptimal processing happening that we will address in a future release (but probably not 570) but hopefully this suffices to make the monitor widget usable again. (for the particularly curious, the issue was due to architectural differences in x86 and risc-v. Code that was "fast enough" on x86 when ported to risc-v ended up in a particularly slow path chasing some function pointers, so the NVML API call was notably slower overall)