Bug 479845

Summary:	The current Wayland GPU recovery experience (AMD) is not ideal with AMS disabled
Product:	[Plasma] kwin	Reporter:	fililip <team>
Component:	wayland-generic	Assignee:	KWin default assignee <kwin-bugs-null>
Status:	RESOLVED FIXED
Severity:	normal	CC:	agurenko, kde, nate, xaver.hugl
Priority:	NOR	Keywords:	qt6
Version First Reported In:	5.92.0
Target Milestone:	---
Platform:	Arch Linux
OS:	Linux
Latest Commit:		Version Fixed In:
Sentry Crash Report:

Description fililip 2024-01-15 13:50:32 UTC

SUMMARY
GPU recovery on Wayland (amdgpu) now either works too slowly, doesn't actually recover (forces a compositor restart) or hangs the input system, forcing SSHing into it and SIGKILLing kwin_wayland to restart the compositor.

This is with one display attached (1080p 165Hz VRR), though. With two (1080p 165Hz VRR & 1080p 60Hz non-VRR), the compositor does not recover at all (or does, but very rarely) and either hangs the input system (forcing to SSH) or restarts itself, making apps (that do not support compositor handoff, I presume, since Konsole stays up just fine) to lose progress. When that happens, dmesg doesn't show gfxhub page faults, but two gfx timeouts and DRM commit failures.

What's more, it used to work just fine on KWin 5.27.5 and Mesa 23.2 - everything happened fast enough, and it worked fine even with two displays. (Unfortunately, back then it was possible for a faulty app to reset the card in a way that did not aid recovery, and there was some kind of VRAM leak.)

dmesg log after the reset completes (one display):
[  377.569608] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  377.569613] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  377.569615] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[  377.569616] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: SQC (data) (0xa)
[  377.569617] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  377.569617] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569618] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  377.569618] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569619] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  377.569622] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  377.569624] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  377.569625] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  377.569625] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  377.569626] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  377.569626] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569627] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  377.569627] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569628] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  377.569631] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  377.569632] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  377.569633] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  377.569634] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  377.569634] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  377.569635] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569635] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  377.569636] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569636] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011857] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  388.011882] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  388.011894] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[  388.011900] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: SQC (data) (0xa)
[  388.011905] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  388.011909] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.011913] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  388.011916] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.011919] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011932] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  388.011942] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  388.011949] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.011953] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  388.011958] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.011961] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.011965] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.011968] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.011971] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011980] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  388.011988] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  388.011993] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.011997] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  388.012001] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.012004] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.012007] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.012010] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.012013] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.012022] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068)
[  388.012029] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address 0x0000800010000000 from client 0x1b (UTCL2)
[  388.012034] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.012037] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  388.012040] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.012044] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.012047] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.012050] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.012053] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.012062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

With two displays, the GFX ring has a very low chance of soft recovery, and when that happens (alongside DRM/CRTC commit failure messages), causes an additional reset which destroys the session entirely - then it *requires* SSHing into the machine.

STEPS TO REPRODUCE
1. Start the Plasma desktop
2. Invoke `sudo cat /sys/kernel/debug/dri/X/amdgpu_gpu_recover` or get an app to cause a gfx timeout

OBSERVED RESULT
Recovery either occurs too slowly or doesn't occur at all, as described in "Summary".

EXPECTED RESULT
The desktop should recover correctly and all robust apps should continue to function

SOFTWARE/OS VERSIONS
Operating System: Arch Linux 
KDE Plasma Version: 5.92.0
KDE Frameworks Version: 5.248.0
Qt Version: 6.7.0
Kernel Version: 6.7.0-zen3-1-zen (64-bit)
Graphics Processor: AMD Radeon RX 6600 XT

ADDITIONAL INFORMATION
This Mesa issue that I reported around 2 months ago might be relevant: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10124, but I'm unsure whether it's actually a Mesa problem, or a KWin problem.

Additionally, the GPU-reset caused Xwayland hang (https://bugs.kde.org/show_bug.cgi?id=475322) is still present (running vkcube after a reset hangs the session), but that is its own issue.

Comment 1 fililip 2024-01-16 21:55:49 UTC

Don't know if it's related to the issue (sorry if it isn't), but I tried applying this MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27097 on top of Mesa 24.0.0-rc1 (and the linked patches against Linux 6.7) and resetting the GPU.

Now it looks completely broken; the gfx ring keeps soft recovering (with one display; with two the whole machine is frozen and even SSH doesn't work) but KWin keeps reset-looping the GPU. The entire session is unusable and requires SIGKILLing KWin a bunch of times to return to TTY.

Comment 2 David Edmundson 2024-01-17 15:36:55 UTC

GPU reset handling is something that's very WIP throughout the stack as you saw from the fact that we have so many pending requests. The fact that we get a second reset implies things are lower in the stack.  This feels like an upstream problem so far.

Would it be better if kwin exited after N resets?

Comment 3 fililip 2024-01-17 20:00:58 UTC

> GPU reset handling is something that's very WIP throughout the stack
By the stack do you mean kwin, upstream, or both? I thought kwin already had GPU recovery support for Wayland, I might be wrong though (unless what's currently present is experimental and that's why it's so hit-or-miss). What is the state of Wayland GPU recovery on other vendors' GPUs though? Does Intel work better? (asking out of curiosity)

> Would it be better if kwin exited after N resets?
Perhaps, if they happened way too close (time interval wise) to one another.

Comment 4 Zamundaaa 2024-01-19 00:47:52 UTC

KWin has very good GPU reset handling, which I've tested a lot, both voluntarily and involuntarily (amdgpu's been way too reset happy the last two weeks or so). The problem is further up in the stack; specifically amdgpu isn't too great at GPU resets, and Mesa had wrong spec interpretation for this until recently as well.

> What is the state of Wayland GPU recovery on other vendors' GPUs though? Does Intel work better?
It is better with both Intel and NVidia, as they only reset the affected app most times. amdgpu will gain the ability to do the same soon though.

Comment 5 fililip 2024-01-19 13:29:38 UTC

Oh, I noticed one important thing - I was using the legacy DRM API (KWIN_DRM_NO_AMS=1) for tearing support. After disabling it, on latest Mesa 24.1 (dev) stuff works fine even on Plasma 5. Sorry for the trouble.

> amdgpu will gain the ability to do the same soon though

That's amazing news! Thank you for your continued effort.

Comment 6 Nate Graham 2024-01-19 17:12:47 UTC

Thanks!

Should we keep this open to improve the non-AMS experience, or say that if you want a good experience, you just need to have AMD enabled?

Comment 7 fililip 2024-01-20 10:23:11 UTC

Personally, I don't think there's a point to maintaining legacy modesetting (even if there's still no tearing support for atomic), unless it's actually used on some devices (it's great for my non-VRR laptop; with a 60Hz display I can handle the broken recovery mechanism since tearing is much better than stuttering in my opinion).

The reason I had it on in the first place was to test a bunch of games with both VRR and tearing on to see what frame rate limit I should set to avoid sporadic frametime jumps below 6.06ms which induced tearing/stuttering. After I was happy with 160 FPS, I forgot to unset the environment variable and that's how I got the issue.

Comment 8 Nate Graham 2024-01-23 19:18:34 UTC

All right, thanks!