Bug 492428 - the kernel crashes and hard-locks the system
Summary: the kernel crashes and hard-locks the system
Status: RESOLVED DUPLICATE of bug 442846
Alias: None
Product: kwin
Classification: Plasma
Component: wayland-generic (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR major
Target Milestone: ---
Assignee: KWin default assignee
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2024-08-30 22:35 UTC by Nathaniel Graham
Modified: 2024-09-24 13:19 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
journalctl -b -1 (903.57 KB, text/plain)
2024-08-30 22:35 UTC, Nathaniel Graham
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nathaniel Graham 2024-08-30 22:35:59 UTC
Created attachment 173145 [details]
journalctl -b -1

SUMMARY
System fully crashes, and I cannot ctrl+alt+f3 to change the TTY

STEPS TO REPRODUCE
1. Start Timberborne on Steam
2. Exit Timberborn

OBSERVED RESULT
The entire system completely freezes up, and I cannot change the TTY to attempt a clean recovery. Power-button is the only way I was able to shutdown/reboot.

EXPECTED RESULT
The game should exit gracefully

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Nobara 40
KDE Plasma Version: 6.1.3
KDE Frameworks Version: 6.4.0
Qt Version: 6.7.2

ADDITIONAL INFORMATION
Graphics Platform: Wayland

When downgrading these packages, the crashes stop happening and the system is stable:
- kwin
- plasma-desktop
- plasma-workspace

The versions that are stable are
kwin 6.1.1-1.fc40
plasma-desktop 6.1.1-2.fc40
plasma-workspace 6.1.3-4.fc40

HARDWARE
Framework Laptop 16 (AMD Ryzen 7040 Series)
Processors: 16 x AMD Ryzen 9 7940HS w/ Radeon 780M Graphics
Memory 62.0 GiB of RAM
Graphics Processor: AMD Radeon 780M + AMD Radeon 7700S Dedicated GPU
Comment 1 Nathaniel Graham 2024-08-30 22:37:45 UTC
I apologize if `core` was the wrong product, I don't actually know what part of KDE handles that interaction. I've uploaded the `journalctl -b -1` log to hopefully clear up what's actually happening
Comment 2 Nathaniel Graham 2024-09-01 07:05:31 UTC
I have narrowed it down to one of the kwin packages. I've upgraded plasma-desktop and plasma-workspace without issue. Then, when I upgraded kwin (kwin, kwin-common, kwin-libs, kwin-wayland, kwin-x11), the issues returned. This transaction is what causes the issue

❯ sudo dnf5 upgrade kwin
Place your finger on the fingerprint reader
Updating and loading repositories:
Repositories loaded.
Package                                                                  Arch             Version                                                                  Repository                                    Size
Upgrading:                                                                                                                                                                                                           
 kwin                                                                    x86_64           6.1.3-2.fc40                                                             nobara-baseos-40                          12.0   B
  replacing kwin                                                         x86_64           6.1.1-1.fc40                                                             nobara-baseos-40                          12.0   B
 kwin-common                                                             x86_64           6.1.3-2.fc40                                                             nobara-baseos-40                          13.1 MiB
  replacing kwin-common                                                  x86_64           6.1.1-1.fc40                                                             nobara-baseos-40                          13.2 MiB
 kwin-libs                                                               x86_64           6.1.3-2.fc40                                                             nobara-baseos-40                           7.9 MiB
  replacing kwin-libs                                                    x86_64           6.1.1-1.fc40                                                             nobara-baseos-40                           7.9 MiB
 kwin-wayland                                                            x86_64           6.1.3-2.fc40                                                             nobara-baseos-40                           1.5 MiB
  replacing kwin-wayland                                                 x86_64           6.1.1-1.fc40                                                             nobara-baseos-40                           1.5 MiB
 kwin-x11                                                                x86_64           6.1.3-2.fc40                                                             nobara-baseos-40                           1.4 MiB
   replacing kwin-x11                                                    x86_64           6.1.1-1.fc40                                                             nobara-baseos-40                           1.4 MiB

Transaction Summary:
 Upgrading:         5 packages
 Replacing:         5 packages
Comment 3 Nathaniel Graham 2024-09-01 07:20:45 UTC
Additionally, now that I've narrowed down the package, I'm more confident that a second bug I've been having is related and relavent.

In Guild Wars 2, if I am using my external monitor (set as primary), and the game is in fullscreen, moving the camera causes the screen to freeze so I cannot see what is happening. It's as if the game is frozen. However, when I release the camera, the game snaps to whatever point I moved the camera to. When I tried to get a recording of this, I was surprised to find that OBS cannot actually record this issue, because OBS records the game footage correctly, with the camera moving properly. It is only my display that freezes, not the game itself.

This is a much less problematic issue of course, however since it occurs with the exact same package update as the main issue, I hope this might still be helpful information.
Comment 4 Nate Graham 2024-09-07 07:57:32 UTC
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: amdgpu: The CS has been rejected, see dmesg for more information (-22).
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE)
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) Backtrace:
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 0: /usr/bin/Xwayland (0x555a34643000+0x166122) [0x555a347a9122]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 1: /usr/bin/Xwayland (0x555a34643000+0x166225) [0x555a347a9225]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 2: /lib64/libc.so.6 (0x7faeb520f000+0x40d00) [0x7faeb524fd00]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 3: /lib64/libc.so.6 (0x7faeb520f000+0x99664) [0x7faeb52a8664]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 4: /lib64/libc.so.6 (gsignal+0x1e) [0x7faeb524fc4e]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 5: /lib64/libc.so.6 (abort+0xdf) [0x7faeb5237902]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 6: /usr/lib64/dri/radeonsi_dri.so (0x7faeb1e00000+0x945a40) [0x7faeb2745a40]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 7: /usr/lib64/dri/radeonsi_dri.so (0x7faeb1e00000+0x948993) [0x7faeb2748993]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 8: /usr/lib64/dri/radeonsi_dri.so (0x7faeb1e00000+0x84b21) [0x7faeb1e84b21]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 9: /usr/lib64/dri/radeonsi_dri.so (0x7faeb1e00000+0xa7bac) [0x7faeb1ea7bac]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 10: /lib64/libc.so.6 (0x7faeb520f000+0x976d7) [0x7faeb52a66d7]
Aug 13 20:26:33 framework16 audit[3223]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=3 subj=unconfined pid=3223 comm="Xwayland:cs0" exe="/usr/bin/Xwayland" sig=6 res=1
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) 11: /lib64/libc.so.6 (0x7faeb520f000+0x11b60c) [0x7faeb532a60c]
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE)
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE)
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: Fatal server error:
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE) Caught signal 6 (Aborted). Server aborting
Aug 13 20:26:33 framework16 kwin_wayland_wrapper[3223]: (EE)
Comment 5 Nathaniel Graham 2024-09-11 19:45:20 UTC
I've discovered that DP-alt mode seems to be causing this. When I switched from using the USB-C port on my monitor (which uses DP-alt mode) to HDMI, the issues completely dissappeared. Guild Wars 2 stopped glitching, and Timberborn stopped crashing.

I have a dock that I tried using that's a USB-C dock with an HDMI output, and that didn't cause the same crashes, but it did cause a similar one. Specifically, Timberborn, when in fullscreen mode, gets capped at 30fps, while in windowed mode, it does not.
Comment 6 Zamundaaa 2024-09-11 22:12:11 UTC
The problem here is on the kernel driver side, mostly. When a GPU reset happens, Xwayland goes down, and KWin has some blocking calls to Xwayland, so it can hang.
Getting rid of those blocking calls is something we can hopefully do at some point, see bug 442846 for that, but the GPU resets should be fixed as well, or the game will still crash. You can report that at https://gitlab.freedesktop.org/mesa/mesa/-/issues

(In reply to Nathaniel Graham from comment #5)
> Specifically, Timberborn, when in fullscreen mode, gets capped at 30fps,
> while in windowed mode, it does not.
Do you have amdvlk installed? If so, uninstalling it will fix that.
Comment 7 Nathaniel Graham 2024-09-12 00:50:16 UTC
(In reply to Zamundaaa from comment #6)
> Do you have amdvlk installed? If so, uninstalling it will fix that.

I do not. It does appear that the 30fps thing wasn't very reproducible, even for me, so I'm not sure what was going on there.
Comment 8 Nathaniel Graham 2024-09-12 23:10:47 UTC
(In reply to Zamundaaa from comment #6)
> but the GPU resets should be fixed as well, or the game will still crash.

One thing that's interesting though is that using gamescope makes this issue go away entirely. The game does not crash, it works perfectly, and quitting the game does not freeze the system. Is it possible that KDE or xwayland or something is actually causing the GPU reset in the first place?
Comment 9 Vlad Zahorodnii 2024-09-17 11:07:08 UTC

*** This bug has been marked as a duplicate of bug 475322 ***
Comment 10 Nathaniel Graham 2024-09-17 16:41:42 UTC
(In reply to Vlad Zahorodnii from comment #9)
> 
> *** This bug has been marked as a duplicate of bug 475322 ***

I cannot reproduce the bug via the method described in bug 475322. I don't believe this is a duplicate

STEPS TO REPRODUCE
1. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover`
2. Attempt to launch an X app
or, alternatively:
1. Have an X app already open
2. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover`


I DID:
1. run `xeyes`
2. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover`

My system recovered successfully without issue. Xeyes and XWayland did not crash
Comment 11 Nathaniel Graham 2024-09-17 16:53:24 UTC
I did just try it with Timberborn (without gamescope), and it did indeed hang *briefly*. However after about 10 seconds, it successfully reset, and Timberborn was still running. This is not what occurs when I exit the game.

However, when attempting the gpu reset when the game was running in gamescope, it did cause gamescope to become unresponsive. The application was still able to be closed from Steam itself.
Comment 12 Nathaniel Graham 2024-09-17 17:06:25 UTC
apologies for the spam, I forget that I cannot update previous comments.

I also tried manually resettting the GPU while running Guild Wars 2. It is also an xwayland+proton game just like Timberborne is, but it does not hang. It simply crashes, referencing an "unrecoverable graphics driver error" (the gpu reset). Using different versions of Proton does *slightly* change the outcome.

Proton 8.0 lets me click "ok", allowing the game to exit as gracefully as it can.
Proton 9.0 does not let me click "ok", and instead, when I close the window, the game closes and Steam crashes (and then recovers).
GE-Proton-9-13 does not let me click "ok", and steam does not crash.
Comment 13 Nathaniel Graham 2024-09-23 02:04:08 UTC
I have just discovered that the Steam Overlay is most likely the source of this bug, or is at least the bit of software triggering it. Turning off the Steam Overlay also causes this bug to go away, even if I don't use gamescope
Comment 14 Zamundaaa 2024-09-24 13:19:52 UTC
There are multiple ways to trigger KWin hanging on Xwayland, but ultimately it's the same problem - Xwayland hanging, or in some cases crashing, can make KWin hang in xcb functions.

*** This bug has been marked as a duplicate of bug 442846 ***