Bug 392376

Summary:	Wayland socket buffer gets filled up and application terminates when GUI thread was blocked
Product:	[Plasma] kwin	Reporter:	Martin Kostolný <clearmartin>
Component:	wayland-generic	Assignee:	KWin default assignee <kwin-bugs-null>
Status:	RESOLVED FIXED
Severity:	normal	CC:	bugseforuns, hunterofgypsy, johan.helsing, magiblot, nate, patrick.auernig, postix, space9301, voidpointertonull+bugskdeorg
Priority:	NOR
Version First Reported In:	git master
Target Milestone:	---
Platform:	Arch Linux
OS:	Linux
URL:	https://gitlab.freedesktop.org/wayland/wayland/-/issues/159
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:

Description Martin Kostolný 2018-03-26 20:01:01 UTC

Original bug report here: https://bugreports.qt.io/browse/QTBUG-66997

Please read the original report, there is also a minimal application with steps to reproduce the issue. The cmake version of the minimal app is "qt-application.tar.gz".

This ticket here is more a discussion starter. Unfortunately I don't understand the specifics so there is probably a question whether the issue should be fixed in compositor (kwin), Qt, or both...

Comment 1 Martin Flöser 2018-03-27 04:22:01 UTC

For KWin there's nothing to do here. It's the task of the client to ensure it handles the events. KWin does not even know that the client stopped processing.

Comment 2 Johan Klokkhammer Helsing 2018-03-27 08:23:16 UTC

Maybe you can send more ping events and stop sending pointer events when a client doesn't answer? I think this is what Weston does. It would probably solve the problem in almost all cases. (I was not able to reproduce a crash on Weston)

It's not currently (without significant hacks) possible to read wayland events without also dispatching them, so there's not much we can really do except tell application code to stop blocking the GUI thread.

From what I can tell, you're also going to have the same problem with blocking GTK clients as they handle events the same way we do.

Comment 3 Martin Flöser 2018-03-27 15:51:15 UTC

Sending all input events is IMHO a feature. The application has a chance to catch up on the events, after being unblocked and nothing is lost.

We can add more pings, sure, but what does it help? Instead of the app crashing, it gets kill -9 by the user. We ping and provide a guitar to kill the application.

IMHO this is neither a problem with the toolkit nor with the compositor. Doing freezing tasks in the main gui thread was a bad idea 15 years ago and still is. And especially Qt makes it extremely easy to move the heavy computations out into a thread. QtConcurrent::run in combination with a qfuturewatcher eliminates all gui freezing.

Comment 4 Martin Flöser 2018-03-27 15:52:06 UTC

Interesting auto completion: gui becomes guitar

Comment 5 Martin Kostolný 2018-05-13 20:54:03 UTC

Thanks for investigating! And sorry for my late response. This is more an informative update.

I've tried a few things recommended by Johan Helsing (https://bugreports.qt.io/browse/QTBUG-66997).

1) Increasing max_dgram_qlen seemed not to help
2) Proposed temporary fix in qtwayland improved the situation but not entirely

Just for info: I'm sure the issue is happening on Weston as well.

I also agree it is application's responsibility to stay responsive. But I fear even though all heavy lifting is done outside GUItar:) thread there may still be situations when this use-case happens. For example I get crashes when I open bigger text file in Kate.

It gets worse when CPU is already under load because of different hungry processes - and in such situation if one performs GUI demanding tasks like moving mouse up/down on Kate minimap, which constantly generates tooltip with text preview, Kate crashes as well. But maybe that can also be fixed by the proposed QtConcurrent::run & qfuturewatcher usage. I'm not sure.

Anyway it seems I'm the only Wayland user hit by this issue so we can probably wait if somebody else will complain. I was merely trying to tell about this issue before Plasma Wayland hits more audience :).

Comment 6 magiblot 2020-04-21 04:53:37 UTC

This issue can be easily hit by anyone with a HDD. Dolphin, Systemsettings, Kate, Falkon... almost every KDE application is vulnerable to this. This is my most frequent crash in Wayland sessions.

I don't know if this adds anything new to the discussion, but I tracked down the origin of the crash in libwayland.

Qt applications crash from QWaylandDisplay::checkError() after wl_display_dispatch_pending returns negative. So I looked into the library to see what was going on.

The error within libwayland takes place when recvmsg returns -1 with errno = 104 ("Connection reset by peer") in wl_os_recvmsg_cloexec (wayland-os.c). This result goes through wl_connection_read (connection.c) until it is handled by read_events (wayland-client.c).

I don't know how wayland or kwin work, so my questions might not make a lot of sense: does the above mean that the connection is reset voluntarily by Kwin, or is it a consequence of the buffer filling up? Can Kwin do anything to prevent the connection from breaking?

Comment 7 magiblot 2020-06-09 13:26:14 UTC

I guess the following comment by Pekka Paalanen from https://gitlab.freedesktop.org/wayland/wayland/-/issues/159 can be considered the opinion of Wayland developers on the issue:

> I still think the first step is to ensure the ping/pong protocol works,
> detects stalls fast enough (e.g. ping should be triggered by first input
> event since the last pong + small timeout), and actually leads to stopping
> input events in the compositor. That is relatively easy to do and should
> go a long way.
This would also make it possible to show "Unresponsive application" dialogs for Wayland clients (assuming it is not implemented yet).

Comment 8 Pedro V 2023-02-20 16:50:06 UTC

It's rather odd to see this issue to be known for so long while there's apparently still no fix. Somewhat ironically Krusader introduced me to this problem as doing heavy operations there like synchronizing directories and moving the cursor over the window was a guaranteed way to lose progress.

Also, I find the "RESOLVED UPSTREAM" status quite amusing. Took it registering here to see that it apparently means that it's resolved as being an upstream problem which is quite unfortunate. Status display could use some improvement.

Is it not feasible though to at least throw in a workaround? Upstream doesn't seem to be too interested in fixing yet, even though the buffer seems to be tiny, and expecting the other side to just "simply" keep up on a non-realtime system appears to be bad design, given that regardless of buffer size, the possibility of programs just not getting scheduled in time is always around, so not tolerating a buffer full state practically guarantees a race condition.

A significantly larger buffer could already help a lot by not punishing at least a single pass of moving the cursor over busy windows so at least people aware of the issue could have an easier time avoiding the problem.
Also, it seems like GNOME stopped waiting for upstream a while ago already: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/2122

I believe it's a common enough issue to deserve more attention, at least I've ran into the issue often enough. It was especially bad on a laptop under heavy I/O pressure which lead to this problem quite often. Life is better with a more powerful desktop so far, and apparently Krusader stays responsive during synchronization nowadays, but currently Firefox taps out every few days aside of mystery issues which may or may not be related, and I know that with the issue being a design flaw causing a race condition, it will be never gone without a fix.

Comment 9 Nate Graham 2023-02-21 19:50:15 UTC

It may indeed be reasonable, yeah.

Comment 10 Patrick Silva 2024-03-26 14:30:36 UTC

This problem persists.

Operating System: Arch Linux 
KDE Plasma Version: 6.0.2
KDE Frameworks Version: 6.0.0
Qt Version: 6.6.2
Graphics Platform: Wayland

Comment 11 Pedro V 2024-04-29 14:39:13 UTC

The problem is definitely not gone completely, but KDE programs got quite resilient over time, and various workarounds tamed most common programs even including Firefox which still tends to lock up globally with mostly malicious websites abusing whatever they can with JS.

This merged change is quite relevant though:
https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188

Comment 12 Pedro V 2024-04-29 14:49:13 UTC

*** Bug 484495 has been marked as a duplicate of this bug. ***

Comment 13 postix 2024-04-29 14:53:06 UTC

(In reply to Pedro V from comment #11)
> This merged change is quite relevant though:
> https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188

There's also an interesting / relevant discussion about further actions needed (to really fix Lutris and other apps):
https://gitlab.freedesktop.org/wayland/wayland/-/issues/443

> A word of caution though, !188 (merged) allows for unbounded buffers on the client side,
> but server side buffers remain bounded and still use the same size as default.

> In this particular case, assuming the server side buffers get filled by HiDPI mouse events,
> the problem occurs on the compositor side, which eventually kills the client.

> All that to say that !188 (merged) alone might not be sufficient,
> you may need to increase the buffer size on the compositor size as well using
> wl_display_set_default_max_buffer_size() (that !188 (merged) adds).

Comment 14 postix 2024-05-27 12:32:23 UTC

*** Bug 486091 has been marked as a duplicate of this bug. ***

Comment 15 postix 2024-05-27 12:45:10 UTC

*** Bug 460513 has been marked as a duplicate of this bug. ***

Comment 16 Vlad Zahorodnii 2024-09-19 07:22:30 UTC

(In reply to Pedro V from comment #11)
> The problem is definitely not gone completely, but KDE programs got quite
> resilient over time, and various workarounds tamed most common programs even
> including Firefox which still tends to lock up globally with mostly
> malicious websites abusing whatever they can with JS.
> 
> This merged change is quite relevant though:
> https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188

kwin already does it.