Bug 392376

Summary: Wayland socket buffer gets filled up and application terminates when GUI thread was blocked
Product: [Plasma] kwin Reporter: Martin Kostolný <clearmartin>
Component: wayland-genericAssignee: KWin default assignee <kwin-bugs-null>
Status: REOPENED ---    
Severity: normal CC: johan.helsing, magiblot, nate, postix, voidpointertonull+bugskdeorg
Priority: NOR    
Version: git master   
Target Milestone: ---   
Platform: Arch Linux   
OS: Linux   
URL: https://gitlab.freedesktop.org/wayland/wayland/-/issues/159
See Also: https://bugs.kde.org/show_bug.cgi?id=433218
Latest Commit: Version Fixed In:

Description Martin Kostolný 2018-03-26 20:01:01 UTC
Original bug report here: https://bugreports.qt.io/browse/QTBUG-66997

Please read the original report, there is also a minimal application with steps to reproduce the issue. The cmake version of the minimal app is "qt-application.tar.gz".

This ticket here is more a discussion starter. Unfortunately I don't understand the specifics so there is probably a question whether the issue should be fixed in compositor (kwin), Qt, or both...
Comment 1 Martin Flöser 2018-03-27 04:22:01 UTC
For KWin there's nothing to do here. It's the task of the client to ensure it handles the events. KWin does not even know that the client stopped processing.
Comment 2 Johan Klokkhammer Helsing 2018-03-27 08:23:16 UTC
Maybe you can send more ping events and stop sending pointer events when a client doesn't answer? I think this is what Weston does. It would probably solve the problem in almost all cases. (I was not able to reproduce a crash on Weston)

It's not currently (without significant hacks) possible to read wayland events without also dispatching them, so there's not much we can really do except tell application code to stop blocking the GUI thread.

From what I can tell, you're also going to have the same problem with blocking GTK clients as they handle events the same way we do.
Comment 3 Martin Flöser 2018-03-27 15:51:15 UTC
Sending all input events is IMHO a feature. The application has a chance to catch up on the events, after being unblocked and nothing is lost.

We can add more pings, sure, but what does it help? Instead of the app crashing, it gets kill -9 by the user. We ping and provide a guitar to kill the application.

IMHO this is neither a problem with the toolkit nor with the compositor. Doing freezing tasks in the main gui thread was a bad idea 15 years ago and still is. And especially Qt makes it extremely easy to move the heavy computations out into a thread. QtConcurrent::run in combination with a qfuturewatcher eliminates all gui freezing.
Comment 4 Martin Flöser 2018-03-27 15:52:06 UTC
Interesting auto completion: gui becomes guitar
Comment 5 Martin Kostolný 2018-05-13 20:54:03 UTC
Thanks for investigating! And sorry for my late response. This is more an informative update.

I've tried a few things recommended by Johan Helsing (https://bugreports.qt.io/browse/QTBUG-66997).

1) Increasing max_dgram_qlen seemed not to help
2) Proposed temporary fix in qtwayland improved the situation but not entirely

Just for info: I'm sure the issue is happening on Weston as well.

I also agree it is application's responsibility to stay responsive. But I fear even though all heavy lifting is done outside GUItar:) thread there may still be situations when this use-case happens. For example I get crashes when I open bigger text file in Kate.

It gets worse when CPU is already under load because of different hungry processes - and in such situation if one performs GUI demanding tasks like moving mouse up/down on Kate minimap, which constantly generates tooltip with text preview, Kate crashes as well. But maybe that can also be fixed by the proposed QtConcurrent::run & qfuturewatcher usage. I'm not sure.

Anyway it seems I'm the only Wayland user hit by this issue so we can probably wait if somebody else will complain. I was merely trying to tell about this issue before Plasma Wayland hits more audience :).
Comment 6 magiblot 2020-04-21 04:53:37 UTC
This issue can be easily hit by anyone with a HDD. Dolphin, Systemsettings, Kate, Falkon... almost every KDE application is vulnerable to this. This is my most frequent crash in Wayland sessions.

I don't know if this adds anything new to the discussion, but I tracked down the origin of the crash in libwayland.

Qt applications crash from QWaylandDisplay::checkError() after wl_display_dispatch_pending returns negative. So I looked into the library to see what was going on.

The error within libwayland takes place when recvmsg returns -1 with errno = 104 ("Connection reset by peer") in wl_os_recvmsg_cloexec (wayland-os.c). This result goes through wl_connection_read (connection.c) until it is handled by read_events (wayland-client.c).

I don't know how wayland or kwin work, so my questions might not make a lot of sense: does the above mean that the connection is reset voluntarily by Kwin, or is it a consequence of the buffer filling up? Can Kwin do anything to prevent the connection from breaking?
Comment 7 magiblot 2020-06-09 13:26:14 UTC
I guess the following comment by Pekka Paalanen from https://gitlab.freedesktop.org/wayland/wayland/-/issues/159 can be considered the opinion of Wayland developers on the issue:

> I still think the first step is to ensure the ping/pong protocol works,
> detects stalls fast enough (e.g. ping should be triggered by first input
> event since the last pong + small timeout), and actually leads to stopping
> input events in the compositor. That is relatively easy to do and should
> go a long way.
This would also make it possible to show "Unresponsive application" dialogs for Wayland clients (assuming it is not implemented yet).
Comment 8 Pedro V 2023-02-20 16:50:06 UTC
It's rather odd to see this issue to be known for so long while there's apparently still no fix. Somewhat ironically Krusader introduced me to this problem as doing heavy operations there like synchronizing directories and moving the cursor over the window was a guaranteed way to lose progress.

Also, I find the "RESOLVED UPSTREAM" status quite amusing. Took it registering here to see that it apparently means that it's resolved as being an upstream problem which is quite unfortunate. Status display could use some improvement.

Is it not feasible though to at least throw in a workaround? Upstream doesn't seem to be too interested in fixing yet, even though the buffer seems to be tiny, and expecting the other side to just "simply" keep up on a non-realtime system appears to be bad design, given that regardless of buffer size, the possibility of programs just not getting scheduled in time is always around, so not tolerating a buffer full state practically guarantees a race condition.

A significantly larger buffer could already help a lot by not punishing at least a single pass of moving the cursor over busy windows so at least people aware of the issue could have an easier time avoiding the problem.
Also, it seems like GNOME stopped waiting for upstream a while ago already: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/2122

I believe it's a common enough issue to deserve more attention, at least I've ran into the issue often enough. It was especially bad on a laptop under heavy I/O pressure which lead to this problem quite often. Life is better with a more powerful desktop so far, and apparently Krusader stays responsive during synchronization nowadays, but currently Firefox taps out every few days aside of mystery issues which may or may not be related, and I know that with the issue being a design flaw causing a race condition, it will be never gone without a fix.
Comment 9 Nate Graham 2023-02-21 19:50:15 UTC
It may indeed be reasonable, yeah.