Please note that I have used Claude AI to analyze this problem, but I have done so thinking along with it, asking critical questions, over the course of several days. I do not have the extensive linux knowledge required to do these kinds of analyses (although I understand the basic functionality of a linux system). It seemed to me that its final conclusion made enough sense to post it here. What follows below is a Claude-generated summary of the issue: SYSTEM INFORMATION Distro: Fedora 43 Plasma: 6.5 KWin: 6.5.5 (Wayland) Kernel: 6.18.9-200.fc43.x86_64 GPU: NVIDIA RTX 4080 (proprietary driver 580.119.02, open kernel modules) CPU/iGPU: AMD Ryzen 7000 (Raphael iGPU, no displays connected) Displays: DP-1 + HDMI-A-1, both on NVIDIA GPU DRM devices: card1 = NVIDIA (pci-0000:01:00.0), card2 = amdgpu (pci-0000:14:00.0) Sleep mode: S3 deep sleep ([deep] in /sys/power/mem_sleep) Initramfs: NVIDIA modules loaded as to correctly display LUKS password prompt STEPS TO REPRODUCE 1. Boot normally, log into Plasma Wayland session 2. Suspend to RAM (S3 deep sleep) - not a manually forced sleep, but having the system do it by itself after the set amount of time 3. Wake the system (press power button or keyboard) EXPECTED BEHAVIOR Displays turn on, lock screen appears, session resumes normally. ACTUAL BEHAVIOR System wakes (fans, disks spin up), but both displays remain permanently black. Ctrl+Alt+F3 (VT switch) also produces no output. The system is otherwise alive (accessible via SSH). Without intervention, a hard reboot is required. KEY FINDING: GPU IS FUNCTIONAL AFTER RESUME Pressing Alt+SysRq+REISUB after a failed resume brings the display back at the S (sync) step. By that point, E (SIGTERM) and I (SIGKILL) have killed all userspace processes including kwin_wayland. The kernel reclaims DRM master and fbcon takes over the display successfully. This proves the GPU hardware and NVIDIA kernel modules are fully functional after resume. The failure is in kwin_wayland, not in the kernel driver. JOURNAL EVIDENCE Resume timeline (journalctl -b -1): 14:09:18 nvidia-suspend.service runs successfully 14:09:19 System enters S3 deep sleep 14:31:05 System wakes — kernel resumes, CPUs come back online 14:31:05 amdgpu resumes normally (no displays connected, expected) 14:31:06 session-2.scope thawed — kwin_wayland is unfrozen 14:31:06 kwin_wayland: Failed to open drm node: "/dev/dri/card0" (card0 doesn't exist, harmless) 14:31:06 nvidia-resume.service starts 14:31:06 kwin_wayland: Atomic modeset test failed! Permission denied <-- FIRST FAILURE 14:31:06 kwin_wayland: Applying output configuration failed! 14:31:06 nvidia-resume.service finishes successfully 14:31:06 kwin_wayland: Setting dpms mode failed! 14:31:15 Hundreds of "Atomic modeset test failed! Permission denied" — never recovers 14:32:20 Still spamming errors — kwin is permanently stuck Relevant kwin_wayland messages: kwin_wayland[2972]: Failed to open drm node: "/dev/dri/card0" kwin_wayland[2972]: Failed to open drm node: "/dev/dri/card0" kwin_wayland[2972]: Atomic modeset test failed! Permission denied kwin_wayland[2972]: Applying output configuration failed! kwin_wayland[2972]: Atomic modeset test failed! Permission denied kwin_wayland[2972]: Setting dpms mode failed! (repeats hundreds of times, never recovers) logind only logs "Operation 'suspend' finished." — there is no evidence of DRM master being re-granted to the session. nvidia-resume.service ran and completed successfully. The NVIDIA kernel driver resumed without errors. ANALYSIS The "Permission denied" error from drmModeAtomicCommit() indicates kwin has lost DRM master status during S3 suspend. Two problems prevent recovery: 1. DRM master is not re-granted after resume. logind does not appear to re-issue DRM master to the active session's kwin instance after S3 resume completes. 2. kwin has no recovery mechanism. Once the first atomic modeset fails, kwin enters an infinite error loop, retrying the same failing operation without ever attempting to re-acquire DRM master. A fresh kwin instance (started after the old one is killed) acquires DRM master from logind without issues. There is also a possible race condition: kwin is unfrozen and attempts modesetting at the same moment nvidia-resume.service is still running. However, the errors persist long after nvidia-resume.service completes, so the race is at most a trigger — the lack of DRM master recovery is the root cause. WHY THIS IS A KWIN BUG (NOT NVIDIA) - The SysRq test proves the GPU and nvidia-drm kernel module are fully operational after resume — fbcon can drive the displays via the same hardware. - A freshly started kwin_wayland (after killing the stuck one) acquires DRM master and works perfectly. - The failure is kwin not recovering from a lost DRM master state, regardless of why the DRM master was lost. Bug 477738 was closed as RESOLVED DOWNSTREAM, attributing this to NVIDIA. The SysRq evidence contradicts that conclusion — the kernel driver works, but kwin does not attempt to re-acquire DRM master when it loses it during suspend. RELATED BUGS Bug 477738 — Same error signature ("Atomic commit failed! Permission denied" after resume). Closed DOWNSTREAM. The SysRq evidence shows the issue is in kwin's lack of DRM master recovery. Bug 509439 — Fixed in KWin 6.5.0 (EGL context handling on resume). We run 6.5.5; this fix is present but insufficient. Bug 478090 — Fixed in Plasma 6.3.1 (lock screen black screen). Present in our version, not our issue. WORKAROUND Pressing Alt+SysRq+E kills all userspace. SDDM restarts, a fresh kwin acquires DRM master, and the session can be restored (unsaved work is lost). Mostly a technical workaround, not a functional one.
UPDATE: POSSIBLE WORKAROUND AND ADDITIONAL DATA Setting KWIN_DRM_DEVICES=/dev/dri/card1 in /etc/environment (restricting kwin to only the NVIDIA GPU) appears to resolve the issue — suspend/resume worked on the next attempt. However, the bug may be intermittent, so this needs more testing. With this variable set, kwin still hits the same "Permission denied" race with nvidia-resume.service, but recovers on its own within milliseconds: 19:08:53.100 session-2.scope thawed 19:08:53.103 nvidia-resume.service starts 19:08:53.109 kwin: Atomic modeset test failed! Permission denied 19:08:53.109 kwin: Setting dpms mode failed! 19:08:53.122 nvidia-resume.service finishes (kwin recovers silently, session resumes normally) Compare with the failed resume (without KWIN_DRM_DEVICES): 14:31:06.826 session-2.scope thawed 14:31:06.828 nvidia-resume.service starts 14:31:06.831 kwin: Failed to open drm node: "/dev/dri/card0" 14:31:06.835 kwin: Failed to open drm node: "/dev/dri/card0" 14:31:06.849 kwin: Atomic modeset test failed! Permission denied 14:31:06.849 kwin: Applying output configuration failed! 14:31:06.851 kwin: Atomic modeset test failed! Permission denied 14:31:06.856 nvidia-resume.service finishes 14:31:15 Error storm begins — 2766+ errors, never recovers One visible difference is kwin trying to open /dev/dri/card0 during the failed resume, which doesn't exist (only card1=NVIDIA and card2=amdgpu are present). This might be what pushes kwin into the "Applying output configuration failed!" code path, which might in turn trigger the unrecoverable retry loop 9 seconds later. That said, a previous boot without KWIN_DRM_DEVICES also hit "Applying output configuration failed!" (from a failed card2 open) and recovered fine — so the card0 probe failure alone doesn't guarantee the loop. The catastrophic failure might require a specific combination of conditions. What does seem clear is that the "Permission denied" modeset error by itself is recoverable — every boot has it briefly during the nvidia-resume race, and kwin handles it. Something additional has to go wrong to trigger the permanent loop.
Thanks for the bug report. This looks like bug 515550, which has a fix in progress, so I'll merge this report in with that one. *** This bug has been marked as a duplicate of bug 515550 ***