Bug 419242 - KWin doesn't recover after GPU reset
Summary: KWin doesn't recover after GPU reset
Status: CLOSED NOT A BUG
Alias: None
Product: kwin
Classification: Plasma
Component: general (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: KWin default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-25 19:28 UTC by Shmerl
Modified: 2020-06-04 14:39 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Shmerl 2020-03-25 19:28:18 UTC
GPU is a complex processing unit, and some graphics commands can hang it due to various "hazards" that are normally worked around in the graphics drivers, but due to some regressions or bugs, they can still be exposed.

When such hangs happen, the kernel driver can detect it and trigger a GPU reset, to bring it back into usable state.

Currently, when it happens KWin doesn't recover from it, causing a desktop hang that requires at best restarting sddm or at worst a full reboot. Either way it equals to losing the desktop session.

Is there some way to make KWin and the whole session robust that it would handle GPU resets properly, recovering the session on the fly?
Comment 1 David Edmundson 2020-03-25 21:33:48 UTC
We read glGetGraphicsResetStatus  and reset appropriately.

If you have any more specific information please provide it and reopen the report.
Comment 2 Shmerl 2020-03-25 22:55:06 UTC
(In reply to David Edmundson from comment #1)
> We read glGetGraphicsResetStatus  and reset appropriately.
> 
> If you have any more specific information please provide it and reopen the
> report.

What exactly is supposed to happen during such reset? In my case, using Firefox caused a GPU hang. The driver reset it, but the whole desktop was still frozen and didn't recover (things were garbled all over), so I assumed KWin doesn't support it yet. If it does, then there must be something going wrong during recovery.

Configuration:

Debian testing
Plasma / kwin: 5.17.5 (compositing with OpenGL 3.1)
GPU: Sapphire Pulse RX 5700 XT
Kernel: 5.6-rc7
Mesa: 20.0.2
llvm: 9.0.1
Comment 3 David Edmundson 2020-03-25 22:59:14 UTC
>The driver reset it,

How do you know?

Did a restart of kwin_x11 fix it?
Comment 4 Shmerl 2020-03-25 23:02:06 UTC
(In reply to David Edmundson from comment #3)
> >The driver reset it,
> 
> How do you know?
> 
> Did a restart of kwin_x11 fix it?

Because it didn't cause a hard hang, like it used to before amdgpu implemented reset for Navi recently. I'll try restarting kwin_x11 next time this happens. What is the right way to do it?
Comment 5 Shmerl 2020-03-25 23:05:53 UTC
Also, when reset happens, there is a brief period when everything is completely frozen, and then it starts reacting on input again, and this what happened here.

It even reacted on me opening krunner with Alt+F2, but all visuals were garbled.
Comment 6 David Edmundson 2020-03-25 23:29:02 UTC
>but the whole desktop was still frozen

>It even reacted on me opening krunner with Alt+F2, but all visuals were garbled.


It can't be both.

>What is the right way to do it?

kwin_x11 --replace
Comment 7 Shmerl 2020-03-25 23:33:59 UTC
It reacted in as something was happening on the screen due to Alt+F2, but it was garbled.

> kwin_x11 --replace

Thanks, I'll give it a try.
Comment 8 Shmerl 2020-03-26 17:56:44 UTC
OK, it happened again, and I can tell it was a reset, since I see it in dmesg (see below).

The session got corrupted again, but was reacting on some garbled fashion to input like before.

I switched to tty, and run

kwin_x11 --replace

The result was:

kwin_x11: FATAL ERROR while trying to open display.

I probably should have set DISPLAY variable?

And after set session was interrupted and fell out into sddm login.

-----------------------
    [16346.087369] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
    [16346.353602] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
    [16346.614280] [drm:gfx_v10_0_cp_gfx_enable [amdgpu]] *ERROR* failed to halt cp gfx
    [16346.679190] pcieport 0000:00:03.2: PME: Spurious native interrupt!
    [16346.695659] snd_hda_intel 0000:0f:00.1: refused to change power state from D3hot to D0
    [16346.799944] snd_hda_intel 0000:0f:00.1: CORB reset timeout#2, CORBRP = 65535
    [16347.079683] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5
    [16349.803271] amdgpu 0000:0f:00.0: GPU reset succeeded, trying to resume
    [16349.803422] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
    [16349.806417] [drm] PSP is resuming...
    [16349.983666] [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
    [16350.175659] amdgpu 0000:0f:00.0: RAS: ras ta ucode is not available
    [16350.199660] amdgpu: [powerplay] SMU is resuming...
    [16350.202662] amdgpu: [powerplay] SMU is resumed successfully!
    [16350.339655] [drm] kiq ring mec 2 pipe 1 q 0
    [16350.352249] [drm] VCN decode and encode initialized successfully(under DPG Mode).
    [16350.352320] [drm] JPEG decode initialized successfully.
    [16350.352323] amdgpu 0000:0f:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
    [16350.352325] amdgpu 0000:0f:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
    [16350.352326] amdgpu 0000:0f:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
    [16350.352327] amdgpu 0000:0f:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
    [16350.352327] amdgpu 0000:0f:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
    [16350.352328] amdgpu 0000:0f:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
    [16350.352329] amdgpu 0000:0f:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
    [16350.352330] amdgpu 0000:0f:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
    [16350.352331] amdgpu 0000:0f:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
    [16350.352332] amdgpu 0000:0f:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
    [16350.352333] amdgpu 0000:0f:00.0: ring sdma0 uses VM inv eng 12 on hub 0
    [16350.352334] amdgpu 0000:0f:00.0: ring sdma1 uses VM inv eng 13 on hub 0
    [16350.352335] amdgpu 0000:0f:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
    [16350.352336] amdgpu 0000:0f:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
    [16350.352337] amdgpu 0000:0f:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
    [16350.352338] amdgpu 0000:0f:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
    [16350.354293] [drm] recover vram bo from shadow start
    [16350.359554] [drm] recover vram bo from shadow done
    [16350.359556] [drm] Skip scheduling IBs!
    [16350.359557] [drm] Skip scheduling IBs!
    [16350.359583] amdgpu 0000:0f:00.0: GPU reset(1) succeeded!
    [16350.359593] [drm] Skip scheduling IBs!
    ...
    [16350.359861] [drm] Skip scheduling IBs!
    [16350.359862] [drm] Skip scheduling IBs!
    [16350.359869] [drm] Skip scheduling IBs!
    [16350.560148] Renderer[2102]: segfault at 0 ip 00007f15202eafcb sp 00007f151a883eb0 error 6 in libxul.so[7f151b48f000+4e7c000]
    [16350.560155] Code: 48 8d 3d a0 a9 4c 02 e8 3b e2 1a fb 85 c0 75 02 58 c3 e8 40 e3 1a fb 0f 1f 84 00 00 00 00 00 50 48 8b 05 d8 a9 3c 02 48 89 10 <89> 34 25 00 00 00 00 e8 21 e3 1a fb 66 0f 1f 84 00 00 00 00 00 85
Comment 9 Shmerl 2020-03-26 17:58:40 UTC
Oh, I missed one preceding dmesg part:

    [16340.923960] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
    [16345.787744] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=945004, emitted seq=945007
    [16345.787817] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kwin_x11 pid 1378 thread kwin_x11:cs0 pid 1451
    [16345.787824] amdgpu 0000:0f:00.0: GPU reset begin!
Comment 10 Shmerl 2020-03-27 16:25:13 UTC
OK, just got another hang and reset. Desktop became garbled, with some reactions to input.

I switched to tty, and run:

DISPLAY=:0 kwin_x11 --replace

That started, and showed various OpenGL info about the system in the tty.

Switching back to the desktop session however showed that it was still garbled, though looked quite different from before kwin replacing. And shortly after that it got completely hung and required REISUB.

So I suppose either recovery path in KWin doesn't really work, or something in between can't handle GPU resets, or something in the driver level doesn't behave correctly when this happens.
Comment 11 David Edmundson 2020-03-27 16:30:17 UTC
>Switching back to the desktop session however showed that it was still garbled,

Then your issue is kernel deep
Comment 12 Shmerl 2020-03-27 16:38:47 UTC
(In reply to David Edmundson from comment #11)
> >Switching back to the desktop session however showed that it was still garbled,
> 
> Then your issue is kernel deep

Yeah, it's quite possibly this: https://bugs.freedesktop.org/show_bug.cgi?id=111481
Comment 13 Shmerl 2020-03-27 16:39:25 UTC
Or more exactly: https://gitlab.freedesktop.org/drm/amd/issues/892
Comment 14 Shmerl 2020-03-27 18:26:39 UTC
According to AMD developers:

> The kernel driver and mesa support the necessary OpenGL robustness extensions to enable this functionality.  I'm not familiar with kwin's implementation however.

So there is some disconnect here, or possibly bugs either in KWin usage of the reset logic or in kernel/Mesa.
Comment 15 Shmerl 2020-03-27 18:34:29 UTC
I can try triggering GPU reset explicitly for a test to see how kwin reacts.
Comment 16 Christoph Feck 2020-04-14 13:55:36 UTC
The current status of this ticket is ambiguous.

Comment 1 says that resets should be handled gracefully in kwin. If they are not, this ticket should get reopened.
Comment 17 David Edmundson 2020-04-14 14:22:12 UTC
There is no indication of this being a kwin bug at this time.
If a complete restart can't fix it, then nothing kwin can do internally could fix it.
Comment 18 Shmerl 2020-04-14 16:46:23 UTC
> There is no indication of this being a kwin bug at this time.

According to AMD developers, KWin is likely not using OpenGL robustness extensions correctly, since GPU reset should be supported properly in the kernel and Mesa stack.

So it is either bug / incorrect recovery logic in KWin, or it's a bug in Mesa or recovery logic in amdgpu (kernel). May be it would be good for KDE and AMD developers to work on this more directly to avoid this disconnect?
Comment 19 Shmerl 2020-04-14 16:46:44 UTC
The bottom line is, for the end user this is not working as it should.
Comment 20 Shmerl 2020-06-03 20:00:45 UTC
So, can anything be done about it? Can you please collaborate with AMD developers on this somehow?

This IS a bug, because in practice, the session doesn't recover. If the reason lies in the radeonsi bugs, then they have to be fixed, but if the reason lies in KWin not using robustness extensions correctly, then KWin has to be fixed.

I'm not an expert on either, so this requires participation of both KDE and Mesa developers to figure out.
Comment 21 David Edmundson 2020-06-04 08:47:59 UTC
You've tested completely restarting kwin and said that did not fix the issue.

If that did not fix it, then absolutely nothing we do with GL robustness or GPU reset handling would fix it. Kwin is not involved in the problem.
Comment 22 Shmerl 2020-06-04 14:39:11 UTC
Or kwin doesn't work correctly. What can be the problem then if it's not kwin?