SUMMARY When a GPU reset is triggered (on amdgpu), the Wayland session successfully recovers (and so do all robust Wayland apps) if there are no X windows in the session. However, if an X app is launched after a GPU reset without restarting Xwayland first (by killing it and having it promptly restored by KWin), or if there are any X apps active when the reset occurs, the entire session hangs to the point where even input doesn't work (can't switch to another tty) and the only way to interact with the computer is to use SSH and kill KWin to restart it from there. STEPS TO REPRODUCE 1. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover` 2. Attempt to launch an X app or, alternatively: 1. Have an X app already open 2. Reset the GPU with `cat /sys/kernel/debug/dri/1/amdgpu_gpu_recover` OBSERVED RESULT The desktop session is frozen EXPECTED RESULT The desktop session recovers and all X apps are at worst killed and at best recovered if robust SOFTWARE/OS VERSIONS Linux: Arch 6.5.5-zen1-1-zen KDE Plasma Version: 5.27.8 KDE Frameworks Version: 5.110.0 Qt Version: 5.15.10 ADDITIONAL INFORMATION Having no X apps open during the reset allows restarting Xwayland which works without issue afterwards
Backtrace of kwin: #0 0x00007fdfcad20ebf in poll () at /usr/lib/libc.so.6 #1 0x00007fdfcfb5820b in () at /usr/lib/libxcb.so.1 #2 0x00007fdfcfb58910 in () at /usr/lib/libxcb.so.1 #3 0x00007fdfcfb59b46 in xcb_wait_for_reply () at /usr/lib/libxcb.so.1 #4 0x00007fdfcec1ab7c in KWin::Xcb::AbstractWrapper<KWin::Xcb::PropertyData>::getReply() (this=0x7ffd1ddc18b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/utils/xcbutils.h:349 #5 0x00007fdfced3d4e9 in KWin::Xcb::AbstractWrapper<KWin::Xcb::PropertyData>::isNull() (this=0x7ffd1ddc18b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/utils/xcbutils.h:271 #6 0x00007fdfced1dc37 in KWin::readWindowProperty(unsigned int, unsigned int, unsigned int, int) (win=12582922, atom=388, type=6, format=32) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:80 #7 0x00007fdfced23f6a in KWin::EffectWindowImpl::readProperty(long, long, int) const (this=0x5573dfd9ec20, atom=388, type=6, format=32) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:2030 #8 0x00005573de5c010a in KWin::BlurEffect::updateBlurRegion(KWin::EffectWindow*) (this=0x5573e045a070, w=0x5573dfd9ec20) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:218 #9 0x00005573de5bf9e2 in KWin::BlurEffect::slotWindowAdded(KWin::EffectWindow*) (this=0x5573e045a070, w=0x5573dfd9ec20) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:273 #10 0x00005573de5bf47c in KWin::BlurEffect::BlurEffect() (this=0x5573e045a070) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/blur.cpp:126 #11 0x00005573de5be582 in KWin::blur_factory::createEffect() const (this=0x5573e03fb930) at /home/david/projects/kde6/src/kde/workspace/kwin/src/plugins/blur/main.cpp:12 #12 0x00007fdfced0ac3b in KWin::PluginEffectLoader::loadEffect(KPluginMetaData const&, QFlags<KWin::LoadEffectFlag>) (this=0x5573dfe4d720, info=..., flags=...) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:310 #13 0x00007fdfced0b0e4 in KWin::PluginEffectLoader::queryAndLoadAll() (this=0x5573dfe4d720) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:331 #14 0x00007fdfced0b73e in KWin::EffectLoader::queryAndLoadAll() (this=0x5573e0f98420) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effectloader.cpp:400 #15 0x00007fdfced1bc50 in KWin::EffectsHandlerImpl::reconfigure() (this=0x5573e13c49d0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:341 #16 0x00007fdfced18d96 in KWin::EffectsHandlerImpl::EffectsHandlerImpl(KWin::Compositor*, KWin::WorkspaceScene*) (this=0x5573e13c49d0, compositor=0x5573dff3d1b0, scene=0x5573e1493850) at /home/david/projects/kde6/src/kde/workspace/kwin/src/effects.cpp:241 #17 0x00007fdfcee39eee in KWin::Application::createEffectsHandler(KWin::Compositor*, KWin::WorkspaceScene*) (this=0x7ffd1ddc4e48, compositor=0x5573dff3d1b0, scene=0x5573e1493850) at /home/david/projects/kde6/src/kde/workspace/kwin/src/main.cpp:353 #18 0x00007fdfcec37f1c in KWin::Compositor::startupWithWorkspace() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:344 #19 0x00007fdfcec4e09b in KWin::WaylandCompositor::start() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor_wayland.cpp:51 #20 0x00007fdfcec39686 in KWin::Compositor::reinitialize() (this=0x5573dff3d1b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:605 #21 0x00007fdfcec39906 in KWin::Compositor::composite(KWin::RenderLoop*) (this=0x5573dff3d1b0, renderLoop=0x5573e003f3b0) at /home/david/projects/kde6/src/kde/workspace/kwin/src/compositor.cpp:630 ---- Xwayland's state: warning: process 1112 is a zombie - the process has already terminated 1) We shouldn't have blocking X11 calls 2) This particular problem might go away when we move xwayland out
I don't know if this makes a difference, but note that Mesa 23.2.1 lacks an AMD GPU reset fix: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25023
I don't think that's relevant (but may as well be); it works with native Wayland apps, just not with XWayland (xwayland blocks kwin). Though, sometimes resetting the GPU does make KWin crash/hang even without XWayland, that seems to be a fix for something different.
I tested Mesa 23.3, as well as 24. The reset mechanism is now extremely buggy; the second reset always resets kwin, killing all apps. Is this a Mesa issue or a Plasma 5 issue (that's fixed in 6)?
It starts with kwin_wayland[60753]: kwin_scene_opengl: Waiting for glGetGraphicsResetStatus to return GL_NO_ERROR timed out! and ends with a long stack trace: #0 0x00007f870c43f73d syscall (libc.so.6 + 0x10e73d) #1 0x00007f87045126ce n/a (radeonsi_dri.so + 0x1126ce) #2 0x00007f8704c9b201 n/a (radeonsi_dri.so + 0x89b201) #3 0x00007f8704c59d08 n/a (radeonsi_dri.so + 0x859d08) #4 0x00007f8704c5b6f3 n/a (radeonsi_dri.so + 0x85b6f3) #5 0x00007f870456dde1 n/a (radeonsi_dri.so + 0x16dde1) #6 0x00007f87059043ae n/a (radeonsi_dri.so + 0x15043ae) #7 0x00007f8704561c3f n/a (radeonsi_dri.so + 0x161c3f) #8 0x00007f8704561d08 n/a (radeonsi_dri.so + 0x161d08) #9 0x00007f870eb99ce4 _ZN4KWin9GLTextureC2Ejiiib (libkwinglutils.so.14 + 0x11ce4) #10 0x0000557a93745f7f n/a (kwin_wayland + 0xbcf7f) #11 0x0000557a93746738 n/a (kwin_wayland + 0xbd738) #12 0x0000557a9374b36d n/a (kwin_wayland + 0xc236d) #13 0x0000557a9374c5eb n/a (kwin_wayland + 0xc35eb) #14 0x00007f870edf9633 n/a (libkwin.so.5 + 0x1f9633) #15 0x00007f870edf9c22 n/a (libkwin.so.5 + 0x1f9c22) #16 0x00007f870edf1710 _ZN4KWin12EffectLoader15queryAndLoadAllEv (libkwin.so.5 + 0x1f1710) #17 0x00007f870ee04146 _ZN4KWin18EffectsHandlerImplC1EPNS_10CompositorEPNS_14WorkspaceSceneE (libkwin.so.5 + 0x204146) #18 0x00007f870edc1302 _ZN4KWin10Compositor20startupWithWorkspaceEv (libkwin.so.5 + 0x1c1302) #19 0x00007f870edb9114 _ZN4KWin10Compositor12reinitializeEv (libkwin.so.5 + 0x1b9114) #20 0x00007f870d4d1097 n/a (libQt5Core.so.5 + 0x2d1097) #21 0x00007f870ed716d7 _ZN4KWin10RenderLoop14frameRequestedEPS0_ (libkwin.so.5 + 0x1716d7) #22 0x00007f870edc8598 _ZN4KWin17RenderLoopPrivate8dispatchEv (libkwin.so.5 + 0x1c8598) #23 0x00007f870d4d1097 n/a (libQt5Core.so.5 + 0x2d1097) #24 0x00007f870d4d2bcf _ZN6QTimer7timeoutENS_14QPrivateSignalE (libQt5Core.so.5 + 0x2d2bcf) #25 0x00007f870d4c3b4e _ZN7QObject5eventEP6QEvent (libQt5Core.so.5 + 0x2c3b4e) #26 0x00007f870cb788ff _ZN19QApplicationPrivate13notify_helperEP7QObjectP6QEvent (libQt5Widgets.so.5 + 0x1788ff) #27 0x00007f870d49c168 _ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent (libQt5Core.so.5 + 0x29c168) #28 0x00007f870d4ea7cb _ZN14QTimerInfoList14activateTimersEv (libQt5Core.so.5 + 0x2ea7cb) #29 0x00007f870d4eacb1 _ZN20QEventDispatcherUNIX13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt5Core.so.5 + 0x2ea> #30 0x0000557a937c0ce2 n/a (kwin_wayland + 0x137ce2) #31 0x00007f870d49ae74 _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt5Core.so.5 + 0x29ae74) #32 0x00007f870d49c313 _ZN16QCoreApplication4execEv (libQt5Core.so.5 + 0x29c313) #33 0x0000557a936dc40b n/a (kwin_wayland + 0x5340b) #34 0x00007f870c358cd0 n/a (libc.so.6 + 0x27cd0) #35 0x00007f870c358d8a __libc_start_main (libc.so.6 + 0x27d8a) #36 0x0000557a936de015 n/a (kwin_wayland + 0x55015) Looks like a Mesa problem, but unsure.
Also, the first GPU reset is *extremely* slow with newer Mesa.
I've just seen KWin hang on a GPU reset as well, and also in the blur effect. registerSupportProperty in effecthandler.cpp should probably not block on xcb, or at least have a timeout of some sort. There's also the question what Xwayland is doing that we hang in xcb; it didn't actually crash for me (or at least there's no coredumps). I sadly didn't check its backtrace before rebooting the PC, and I haven't been able to reproduce it (with intentional GPU resets) since then.
As funny as that sounds, you can try using HIP with Blender to trigger a real reset (https://projects.blender.org/blender/blender/issues/100353, this is still not fixed...): 1. Open Blender 2. Select the Cycles HIP backend and your AMD GPU 3. Hit F12 4. Use the viewport renderer at the same time 5. This should almost instantly trigger a reset
I've just had an idea: would it be possible to kill/restart Xwayland (togglable somehow with an env var or something similar) on a reset? That would be much better than a frozen desktop session.
This is no longer an issue as of 6.0.2.
Oops, unfortunately not, I ran into this issue again after I wrote a fragment shader that times out the GPU: float t = mix(0.0, 1.0, sdf); for (float i = 0.0; i < 700000000.0; i += 0.01) { t += distance(gl_FragCoord.xy, aspect); } I had VSCode (Xwayland) open and while the reset succeeded, KWin was completely stuck and I had to kill it.
Having this problem on Plasma 6.
To anyone who does shader dev and is worried about AMD GPU resets, I recommend applying the relevant patches for your GPUs from here: https://patchwork.freedesktop.org/series/136246/ I did just that, now I cannot get the card to reset at all, it just kicks the faulty job out of the scheduler and everything else continues to work fine.
Which kernel series does that patch apply cleanly to? I can't get it to apply to 6.10. Does it require drm-next? linux-next?
(In reply to fililip from comment #13) > To anyone who does shader dev and is worried about AMD GPU resets, I > recommend applying the relevant patches for your GPUs from here: > https://patchwork.freedesktop.org/series/136246/ > > I did just that, now I cannot get the card to reset at all, it just kicks > the faulty job out of the scheduler and everything else continues to work > fine. Never mind. This patch set is broken, and causes a double add on GPU resets. I needed to locate the Git repository the up to date version exists in and pull that.
Oh, it looks like this issue is already tracked on Xorg's GitLab: https://gitlab.freedesktop.org/xorg/xserver/-/issues/1612 For this reason maybe it's better to mark this issue as upstream?
(In reply to fililip from comment #16) > Oh, it looks like this issue is already tracked on Xorg's GitLab: > https://gitlab.freedesktop.org/xorg/xserver/-/issues/1612 > > For this reason maybe it's better to mark this issue as upstream? No, the upstream thing is about Xwayland recovering, but KWin still shouldn't hang just because Xwayland crashes
Yeah, you're right, that makes sense. I came back to this issue, played with resets a bit more and noticed something odd with this hang: 1) I started a new, clean session, 2) I launched vkcube, 3) I triggered a reset with debugfs, the desktop recovered properly, the vkcube window became stuck, but no hang yet (that's why you might have gotten no coredump before), 4) I then attempted to move the vkcube window (with the Super + left mouse button combo) and immediately (a few frames later) got a hang. I presume this is when Xwayland crashed and became a zombie process. This feels like KWin is trying to process an event for Xwayland but fails and waits indefinitely (maybe for the event sockets that don't get a chance to unregister in time?). I'm also able to trigger this hang without a GPU reset by simply doing killall -9 Xwayland a few times rapidly after starting a new session, which suggests this is not just a graphics reset issue. Though this method only works sometimes, it doesn't hang all the time. (But the hang itself also seems non-deterministic, since I've also managed to crash Xwayland with a graphics reset without it blocking anything.) Some time ago you mentioned timing the blur effect out when it takes too long to execute. Would it be possible to do something similar for Xwayland, so that when it crashes and enters the zombie state, KWin can continue to function, or would that break functionality in some X11 apps (like games that freeze for a bit when loading/processing shaders that might be unnecessarily killed by such mechanism)?
*** Bug 492428 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 442846 ***