Bug 475322 - A GPU reset (amdgpu) causes Xwayland to hang kwin if an app interacts with X11
Summary: A GPU reset (amdgpu) causes Xwayland to hang kwin if an app interacts with X11
Status: RESOLVED DUPLICATE of bug 442846
Alias: None
Product: kwin
Classification: Plasma
Component: wayland-generic (show other bugs)
Version: 5.27.8
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: KWin default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-07 14:08 UTC by fililip
Modified: 2024-09-17 11:07 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description fililip 2023-10-07 14:08:38 UTC
SUMMARY
When a GPU reset is triggered (on amdgpu), the Wayland session successfully recovers (and so do all robust Wayland apps) if there are no X windows in the session.
However, if an X app is launched after a GPU reset without restarting Xwayland first (by killing it and having it promptly restored by KWin), or if there are any X apps active when the reset occurs, the entire session hangs to the point where even input doesn't work (can't switch to another tty) and the only way to interact with the computer is to use SSH and kill KWin to restart it from there.

STEPS TO REPRODUCE
1. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover`
2. Attempt to launch an X app
or, alternatively:
1. Have an X app already open
2. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover`

OBSERVED RESULT
The desktop session is frozen

EXPECTED RESULT
The desktop session recovers and all X apps are at worst killed and at best recovered if robust

SOFTWARE/OS VERSIONS
Linux: Arch 6.5.5-zen1-1-zen
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.110.0
Qt Version: 5.15.10

ADDITIONAL INFORMATION
Having no X apps open during the reset allows restarting Xwayland which works without issue afterwards
Comment 1 David Edmundson 2023-10-11 07:15:14 UTC
Backtrace of kwin:

#0  0x00007fdfcad20ebf in poll () at /usr/lib/libc.so.6
#1  0x00007fdfcfb5820b in  () at /usr/lib/libxcb.so.1
#2  0x00007fdfcfb58910 in  () at /usr/lib/libxcb.so.1
#3  0x00007fdfcfb59b46 in xcb_wait_for_reply () at /usr/lib/libxcb.so.1
#4  0x00007fdfcec1ab7c in KWin::Xcb::AbstractWrapper<KWin::Xcb::PropertyData>::getReply() (this=0x7ffd1ddc18b0)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/utils/xcbutils.h:349
#5  0x00007fdfced3d4e9 in KWin::Xcb::AbstractWrapper<KWin::Xcb::PropertyData>::isNull() (this=0x7ffd1ddc18b0)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/utils/xcbutils.h:271
#6  0x00007fdfced1dc37 in KWin::readWindowProperty(unsigned int, unsigned int, unsigned int, int) (win=12582922, atom=388, type=6, format=32)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:80
#7  0x00007fdfced23f6a in KWin::EffectWindowImpl::readProperty(long, long, int) const (this=0x5573dfd9ec20, atom=388, type=6, format=32)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:2030
#8  0x00005573de5c010a in KWin::BlurEffect::updateBlurRegion(KWin::EffectWindow*) (this=0x5573e045a070, w=0x5573dfd9ec20)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:218
#9  0x00005573de5bf9e2 in KWin::BlurEffect::slotWindowAdded(KWin::EffectWindow*) (this=0x5573e045a070, w=0x5573dfd9ec20)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:273
#10 0x00005573de5bf47c in KWin::BlurEffect::BlurEffect() (this=0x5573e045a070) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:126
#11 0x00005573de5be582 in KWin::blur_factory::createEffect() const (this=0x5573e03fb930) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/main.cpp:12
#12 0x00007fdfced0ac3b in KWin::PluginEffectLoader::loadEffect(KPluginMetaData const&, QFlags<KWin::LoadEffectFlag>) (this=0x5573dfe4d720, info=..., flags=...)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:310
#13 0x00007fdfced0b0e4 in KWin::PluginEffectLoader::queryAndLoadAll() (this=0x5573dfe4d720) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:331
#14 0x00007fdfced0b73e in KWin::EffectLoader::queryAndLoadAll() (this=0x5573e0f98420) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:400
#15 0x00007fdfced1bc50 in KWin::EffectsHandlerImpl::reconfigure() (this=0x5573e13c49d0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:341
#16 0x00007fdfced18d96 in KWin::EffectsHandlerImpl::EffectsHandlerImpl(KWin::Compositor*, KWin::WorkspaceScene*)
    (this=0x5573e13c49d0, compositor=0x5573dff3d1b0, scene=0x5573e1493850) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:241
#17 0x00007fdfcee39eee in KWin::Application::createEffectsHandler(KWin::Compositor*, KWin::WorkspaceScene*)
    (this=0x7ffd1ddc4e48, compositor=0x5573dff3d1b0, scene=0x5573e1493850) at /home/david/projects/kde6/src/kde/workspace/kwin/src/main.cpp:353
#18 0x00007fdfcec37f1c in KWin::Compositor::startupWithWorkspace() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:344
#19 0x00007fdfcec4e09b in KWin::WaylandCompositor::start() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor_wayland.cpp:51
#20 0x00007fdfcec39686 in KWin::Compositor::reinitialize() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:605
#21 0x00007fdfcec39906 in KWin::Compositor::composite(KWin::RenderLoop*) (this=0x5573dff3d1b0, renderLoop=0x5573e003f3b0)
    at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:630

----

Xwayland's state: warning: process 1112 is a zombie - the process has already terminated


1) We shouldn't have blocking X11 calls
2) This particular problem might go away when we move xwayland out
Comment 2 Peter Eszlari 2023-11-01 09:23:20 UTC
I don't know if this makes a difference, but note that Mesa 23.2.1 lacks an AMD GPU reset fix:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25023
Comment 3 fililip 2023-11-07 14:49:29 UTC
I don't think that's relevant (but may as well be); it works with native Wayland apps, just not with XWayland (xwayland blocks kwin).
Though, sometimes resetting the GPU does make KWin crash/hang even without XWayland, that seems to be a fix for something different.
Comment 4 fililip 2023-11-10 00:52:51 UTC
I tested Mesa 23.3, as well as 24. The reset mechanism is now extremely buggy; the second reset always resets kwin, killing all apps.
Is this a Mesa issue or a Plasma 5 issue (that's fixed in 6)?
Comment 5 fililip 2023-11-10 01:00:51 UTC
It starts with
kwin_wayland[60753]: kwin_scene_opengl: Waiting for glGetGraphicsResetStatus to return GL_NO_ERROR timed out!

and ends with a long stack trace:
#0  0x00007f870c43f73d syscall (libc.so.6 + 0x10e73d)
#1  0x00007f87045126ce n/a (radeonsi_dri.so + 0x1126ce)
#2  0x00007f8704c9b201 n/a (radeonsi_dri.so + 0x89b201)
#3  0x00007f8704c59d08 n/a (radeonsi_dri.so + 0x859d08)
#4  0x00007f8704c5b6f3 n/a (radeonsi_dri.so + 0x85b6f3)
#5  0x00007f870456dde1 n/a (radeonsi_dri.so + 0x16dde1)
#6  0x00007f87059043ae n/a (radeonsi_dri.so + 0x15043ae)
#7  0x00007f8704561c3f n/a (radeonsi_dri.so + 0x161c3f)
#8  0x00007f8704561d08 n/a (radeonsi_dri.so + 0x161d08)
#9  0x00007f870eb99ce4 _ZN4KWin9GLTextureC2Ejiiib (libkwinglutils.so.14 + 0x11ce4)
#10 0x0000557a93745f7f n/a (kwin_wayland + 0xbcf7f)
#11 0x0000557a93746738 n/a (kwin_wayland + 0xbd738)
#12 0x0000557a9374b36d n/a (kwin_wayland + 0xc236d)
#13 0x0000557a9374c5eb n/a (kwin_wayland + 0xc35eb)
#14 0x00007f870edf9633 n/a (libkwin.so.5 + 0x1f9633)
#15 0x00007f870edf9c22 n/a (libkwin.so.5 + 0x1f9c22)
#16 0x00007f870edf1710 _ZN4KWin12EffectLoader15queryAndLoadAllEv (libkwin.so.5 + 0x1f1710)
#17 0x00007f870ee04146 _ZN4KWin18EffectsHandlerImplC1EPNS_10CompositorEPNS_14WorkspaceSceneE (libkwin.so.5 + 0x204146)
#18 0x00007f870edc1302 _ZN4KWin10Compositor20startupWithWorkspaceEv (libkwin.so.5 + 0x1c1302)
#19 0x00007f870edb9114 _ZN4KWin10Compositor12reinitializeEv (libkwin.so.5 + 0x1b9114)
#20 0x00007f870d4d1097 n/a (libQt5Core.so.5 + 0x2d1097)
#21 0x00007f870ed716d7 _ZN4KWin10RenderLoop14frameRequestedEPS0_ (libkwin.so.5 + 0x1716d7)
#22 0x00007f870edc8598 _ZN4KWin17RenderLoopPrivate8dispatchEv (libkwin.so.5 + 0x1c8598)
#23 0x00007f870d4d1097 n/a (libQt5Core.so.5 + 0x2d1097)
#24 0x00007f870d4d2bcf _ZN6QTimer7timeoutENS_14QPrivateSignalE (libQt5Core.so.5 + 0x2d2bcf)
#25 0x00007f870d4c3b4e _ZN7QObject5eventEP6QEvent (libQt5Core.so.5 + 0x2c3b4e)
#26 0x00007f870cb788ff _ZN19QApplicationPrivate13notify_helperEP7QObjectP6QEvent (libQt5Widgets.so.5 + 0x1788ff)
#27 0x00007f870d49c168 _ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent (libQt5Core.so.5 + 0x29c168)
#28 0x00007f870d4ea7cb _ZN14QTimerInfoList14activateTimersEv (libQt5Core.so.5 + 0x2ea7cb)
#29 0x00007f870d4eacb1 _ZN20QEventDispatcherUNIX13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt5Core.so.5 + 0x2ea>
#30 0x0000557a937c0ce2 n/a (kwin_wayland + 0x137ce2)
#31 0x00007f870d49ae74 _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt5Core.so.5 + 0x29ae74)
#32 0x00007f870d49c313 _ZN16QCoreApplication4execEv (libQt5Core.so.5 + 0x29c313)
#33 0x0000557a936dc40b n/a (kwin_wayland + 0x5340b)
#34 0x00007f870c358cd0 n/a (libc.so.6 + 0x27cd0)
#35 0x00007f870c358d8a __libc_start_main (libc.so.6 + 0x27d8a)
#36 0x0000557a936de015 n/a (kwin_wayland + 0x55015)

Looks like a Mesa problem, but unsure.
Comment 6 fililip 2023-11-10 01:11:31 UTC
Also, the first GPU reset is *extremely* slow with newer Mesa.
Comment 7 Zamundaaa 2023-11-23 18:20:41 UTC
I've just seen KWin hang on a GPU reset as well, and also in the blur effect. registerSupportProperty in effecthandler.cpp should probably not block on xcb, or at least have a timeout of some sort.
There's also the question what Xwayland is doing that we hang in xcb; it didn't actually crash for me (or at least there's no coredumps). I sadly didn't check its backtrace before rebooting the PC, and I haven't been able to reproduce it (with intentional GPU resets) since then.
Comment 8 fililip 2024-01-02 16:10:50 UTC
As funny as that sounds, you can try using HIP with Blender to trigger a real reset (https://projects.blender.org/blender/blender/issues/100353, this is still not fixed...):
1. Open Blender
2. Select the Cycles HIP backend and your AMD GPU
3. Hit F12
4. Use the viewport renderer at the same time
5. This should almost instantly trigger a reset
Comment 9 fililip 2024-01-19 16:06:55 UTC
I've just had an idea: would it be possible to kill/restart Xwayland (togglable somehow with an env var or something similar) on a reset? That would be much better than a frozen desktop session.
Comment 10 fililip 2024-03-21 22:03:25 UTC
This is no longer an issue as of 6.0.2.
Comment 11 fililip 2024-03-24 13:34:41 UTC
Oops, unfortunately not, I ran into this issue again after I wrote a fragment shader that times out the GPU:

float t = mix(0.0, 1.0, sdf);
for (float i = 0.0; i < 700000000.0; i += 0.01) {
    t += distance(gl_FragCoord.xy, aspect);
}

I had VSCode (Xwayland) open and while the reset succeeded, KWin was completely stuck and I had to kill it.
Comment 12 Christopher Snowhill 2024-07-20 02:34:08 UTC
Having this problem on Plasma 6.
Comment 13 fililip 2024-07-22 17:01:48 UTC
To anyone who does shader dev and is worried about AMD GPU resets, I recommend applying the relevant patches for your GPUs from here: https://patchwork.freedesktop.org/series/136246/

I did just that, now I cannot get the card to reset at all, it just kicks the faulty job out of the scheduler and everything else continues to work fine.
Comment 14 Christopher Snowhill 2024-07-22 22:54:24 UTC
Which kernel series does that patch apply cleanly to? I can't get it to apply to 6.10. Does it require drm-next? linux-next?
Comment 15 Christopher Snowhill 2024-07-23 07:09:46 UTC
(In reply to fililip from comment #13)
> To anyone who does shader dev and is worried about AMD GPU resets, I
> recommend applying the relevant patches for your GPUs from here:
> https://patchwork.freedesktop.org/series/136246/
> 
> I did just that, now I cannot get the card to reset at all, it just kicks
> the faulty job out of the scheduler and everything else continues to work
> fine.

Never mind. This patch set is broken, and causes a double add on GPU resets. I needed to locate the Git repository the up to date version exists in and pull that.
Comment 16 fililip 2024-08-09 16:16:52 UTC
Oh, it looks like this issue is already tracked on Xorg's GitLab: https://gitlab.freedesktop.org/xorg/xserver/-/issues/1612

For this reason maybe it's better to mark this issue as upstream?
Comment 17 Zamundaaa 2024-08-09 17:47:58 UTC
(In reply to fililip from comment #16)
> Oh, it looks like this issue is already tracked on Xorg's GitLab:
> https://gitlab.freedesktop.org/xorg/xserver/-/issues/1612
> 
> For this reason maybe it's better to mark this issue as upstream?

No, the upstream thing is about Xwayland recovering, but KWin still shouldn't hang just because Xwayland crashes
Comment 18 fililip 2024-08-12 22:04:13 UTC
Yeah, you're right, that makes sense.

I came back to this issue, played with resets a bit more and noticed something odd with this hang:

1) I started a new, clean session,
2) I launched vkcube,
3) I triggered a reset with debugfs, the desktop recovered properly, the vkcube window became stuck, but no hang yet (that's why you might have gotten no coredump before),
4) I then attempted to move the vkcube window (with the Super + left mouse button combo) and immediately (a few frames later) got a hang. I presume this is when Xwayland crashed and became a zombie process.

This feels like KWin is trying to process an event for Xwayland but fails and waits indefinitely (maybe for the event sockets that don't get a chance to unregister in time?).

I'm also able to trigger this hang without a GPU reset by simply doing killall -9 Xwayland a few times rapidly after starting a new session, which suggests this is not just a graphics reset issue. Though this method only works sometimes, it doesn't hang all the time. (But the hang itself also seems non-deterministic, since I've also managed to crash Xwayland with a graphics reset without it blocking anything.)

Some time ago you mentioned timing the blur effect out when it takes too long to execute. Would it be possible to do something similar for Xwayland, so that when it crashes and enters the zombie state, KWin can continue to function, or would that break functionality in some X11 apps (like games that freeze for a bit when loading/processing shaders that might be unnecessarily killed by such mechanism)?
Comment 19 Vlad Zahorodnii 2024-09-17 11:07:08 UTC
*** Bug 492428 has been marked as a duplicate of this bug. ***
Comment 20 Zamundaaa 2024-09-17 11:07:18 UTC

*** This bug has been marked as a duplicate of bug 442846 ***