Bug 377901

Summary: Window movement breaks after 49 days
Product: [Plasma] kwin Reporter: Sven-Haegar Koch <mail-kde>
Component: coreAssignee: KWin default assignee <kwin-bugs-null>
Status: RESOLVED FIXED    
Severity: critical CC: 2112871, bellersenm+bugs.kde, chrgerloff, kennethlakin
Priority: HI Flags: mgraesslin: ReviewRequest+
Version: 5.8.5   
Target Milestone: ---   
Platform: Debian unstable   
OS: Linux   
URL: https://phabricator.kde.org/D5704
Latest Commit: Version Fixed In: 5.8.7
Sentry Crash Report:

Description Sven-Haegar Koch 2017-03-21 21:51:03 UTC
Hallo,

After system startup everything works great.

But after running X11 for around 49 days moving windows with click+drag on the titlebar and windows switching using Alt-Tab with kwin_x11 stop working.

The sympton is exactly like described in https://bugs.kde.org/show_bug.cgi?id=348569 - only without using Cluster-SSH, only using xterm, Firefox and Chrome:

 Thomas Lübking 2015-06-02 07:38:39 UTC
a) Do windows still have a titlebar?
b) Can you move them using "Alt+Left Mousebutton"?
c) Can you switch focus using Alt+Tab?
d) Can you interact with the remaining windows (with the mouse) at all?

 Jan-Marten Brüggemann 2015-06-02 07:49:44 UTC
a) Yes, I have titlebars and can use the minimize, maximize and close buttons as well.
b) No
c) No
d) Yes, all functions of the remaining windows are normal, the only thing is, I can not move any windows and alt+tab does not work.

Putting windows to the front/back by clicking on the title bar of half-overlapping windows still works, as does focus-changing with click or focus-follows mouse.

Only moving is not possible anymore, until X11 gets restarted - then it will work again for 49 days, even without a system reboot, and then break again.

And this problem is reproduceable on multiple computers and for a quite long time (years).

I will not restart the currently affected laptop just yet, in case there are some more infos to find out about the current broken state.

The 49 day "delay" until the bug hits makes it quite difficult to quickly reproduce.

Greetings
Haegar
Comment 1 Ivan Stanton 2017-03-21 22:31:29 UTC
Sure this is not a duplicate of that bug?
To help diagnose the error, you may need to post the output of either:
kwin-x11 --replace
if you use X11 or
kwin-wayland --replace
if you use Wayland (which is almost perfect in 5.10 previews!)
Plus this is 5.8.5; in neon, also probably in Arch and Parabola, 5.8.6 is available
Comment 2 Sven-Haegar Koch 2017-03-21 23:06:08 UTC
As the platform of the bug says this is Debian Unstable, which at least right now still has 5.8.5.

haegar@aurora:~$ kwin_x11 --replace
Module 'org.kde.kwin.decoration' does not contain a module identifier directive - it cannot be protected from external registrations.
kwin_core: Compositing is not possible
QPainter::begin: Paint device returned engine == 0, type: 3
QPainter::setRenderHint: Painter must be active to set rendering hints
QPainter::setWindow: Painter not active
QPainter::setClipRect: Painter not active
QPainter::begin: Paint device returned engine == 0, type: 3
QPainter::setRenderHint: Painter must be active to set rendering hints
QPainter::setWindow: Painter not active
QPainter::setClipRect: Painter not active

Clicking on a window titlebar to move it gets the output:

QXcbConnection: XCB error: 3 (BadWindow), sequence: 6086, resource id: 44119848, major code: 2 (ChangeWindowAttributes), minor code: 0
QXcbConnection: XCB error: 3 (BadWindow), sequence: 7820, resource id: 67109147, major code: 18 (ChangeProperty), minor code: 0
QXcbConnection: XCB error: 3 (BadWindow), sequence: 9398, resource id: 67109148, major code: 18 (ChangeProperty), minor code: 0
QXcbConnection: XCB error: 3 (BadWindow), sequence: 9614, resource id: 67109149, major code: 18 (ChangeProperty), minor code: 0

Not sure that this bug does not have the same source - but as that bug is around Cluster-SSH sending invalid timestamps, which I neither use nor have installed, I decided to open a new one.
Comment 3 Kenneth Lakin 2017-04-27 00:49:58 UTC
This affects me, too, exactly as described.

I'm using 32-bit ~x86 Gentoo Linux on a 32-bit system.

KWin/Plasma 5.9.2
QT 5.7.1 and 4.8.7
xorg 7.4-r2
Wayland is not installed.

I'm using kwin_x11 rather than kwin_wayland

If I refrain from either Alt+Tabbing _or_ switching virtual desktops, kwin is perfectly happy to keep resizing and moving windows. If I do _either_ of those things, kwin emits a single "QXcbConnection: XCB error: 3 ... " message to stdout every time I try to move or resize a window. Alt+Tab doesn't produce any messages to stdout.

While kwin is failing to resize/move windows, I can:

Alter window Z-order (either by clicking on a window's titlebar or using the raise/lower window keyboard shortcuts)
Switch between virtual desktops.
Create new windows.
Close windows.
Start krunner (but unlike normal (where it automatically gets keyboard focus) I have to manually click it for it to have keyboard focus).

The only "fix" is to restart kwin_x11 and never press alt+tab, or switch virtual desktops. :(

I've tried restarting kwin_x11 by doing all of

killall kwin_x11 ; sleep $X ; kwin_x11 &
kwin_x11 --replace &
kstart5 -- kwin_x11 --replace 
kwrapper5 kwin_x11 --replace &

None of them fix the problem.

I've gotta use this computer for work, so I'm going to begrudgingly go throw all my state in a trash fire and restart KDE. I politely suggest that someone fire up a VM and leave it running for ~49.5 days to debug this. Some of us have long-running systems that we don't want to have to restart for no reasonable reason.

(Don't get me wrong. As a programmer, I get that KDE is a complicated system. I'm also fully aware that most folks working on it are volunteers. :) I'm also aware that the faulty component might not even be _in_ KDE. But, like, _Windows 95_ was affected by this sort of error. Everyone who _ever_ works with anything having to do with time should have learned from that. ;))

I'm more than happy to provide additional information upon request. :)
Comment 4 Christoph Feck 2017-04-27 01:11:02 UTC
49 days are 2^32 ms, so it might be because of a timer only using 32 bit.
Comment 5 Martin Flöser 2017-04-27 05:13:17 UTC
Waiting for 49 from a dev running the system in a VM won't really help you. All we could know then is that we can reproduce it, that info is not worth to wait two days.

The good information is the 32 bit msec timestamp. Those are used a lot on X11 so it's quite likely that they overflow. I try to investigate where they are used and what a smaller timestamp might mean. We do have sanity checks there, maybe they need to be weakened.
Comment 6 Kenneth Lakin 2017-04-27 05:37:02 UTC
> We do have sanity checks there, maybe they need to be weakened.

I'm 100% certain that I had uninterrupted KDE4 sessions that ran trouble-free for longer than 50 days.

Best of luck in tracking down the errors. Thanks for looking at this!
Comment 7 Martin Flöser 2017-04-27 14:50:18 UTC
We do have the following code:

    void setX11Time(xcb_timestamp_t timestamp) {
        if (timestamp > m_x11Time) {
            m_x11Time = timestamp;
        }
    }

this means if the timestamp wraps around we will never update again. And the x11Time is used for various checks. So this is clearly fishy. I need to check in our git log why we did the sanity checking here, but I'm sure there is a good reason for it.
Comment 8 Martin Flöser 2017-04-27 14:52:33 UTC
Relevant commit in KWin is:
commit d95ab94f0a664c09223cf1b6d273da41b1cb625e
Author: Martin Gräßlin <mgraesslin@kde.org>
Date:   Thu Feb 19 09:59:21 2015 +0100

    Migrate away from QX11Info::appTime
    
    The porting to Qt5 broke the timestamp handling in many areas. A deeper
    look into Qt's xcb plugin shows that the appTime handling is not
    sufficient for KWin's need. E.g. the time is only updated in response to
    a property notify event if it's for a Qt created window, which is hardly
    ever the case in KWin. Another example is that key press/release events
    never updated the appTime.
    
    As the functionality in Qt is rather trivial we can do the timestamp
    handling ourselves. We filter all events anyway and it is slightly faster
    as we don't have to go through the QPA interface any more.

----

This means I need to compare how it was done in Qt 4.
Comment 9 Martin Flöser 2017-04-28 05:16:25 UTC
Some additional investigation results: both moving windows and Alt+Tab require a keyboard/pointer grab. The xTime() is going into those requests and the documentation says:

"The request fails with status InvalidTime if the specified time is earlier than the last-pointer-grab time or later than the current server time. Otherwise, the last-pointer-grab time is set to the specified time, with CurrentTime replaced by the current server time."

If we don't update the xTime any more due to overflow we are in the situation that our specified time is "later than current server time". On the other hand if we pass the new wrapped around number we fall into the condition that "specified time is earlier than the last-pointer-grab time".

To illustrate: let's assume the last-pointer-grab time is 2^32, Now we increment by 1 and will have a server time of 0. If we send in 0 it is earlier than the last-pointer-grab time. On the other hand if we send in our last state from updating X time we are later than the server time.

To a certain degree this is broken in the X11 specification and if the X-Server behaves exactly like described in the spec we won't be able to fix it. KWin cannot use the CurrentTime check as we need to be explicit in the time we request, or we would run into other bugs. So currently what I see is that we need to relax our conditions in order to get the grab still working, but window management will probably still be broken, till some other application requests a pointer grab with 0.
Comment 10 Kenneth Lakin 2017-04-28 19:41:01 UTC
(In reply to Martin Gräßlin from comment #9)
> Some additional investigation results: both moving windows and Alt+Tab
> require a keyboard/pointer grab. The xTime() is going into those requests
> and the documentation says:
> 
> "The request fails with status InvalidTime if the specified time is earlier
> than the last-pointer-grab time or later than the current server time.
> Otherwise, the last-pointer-grab time is set to the specified time, with
> CurrentTime replaced by the current server time."

I may be out of my depth (or reading the wrong documentation), but this quote seems relevant:

"A timestamp is a time value, expressed in milliseconds. It typically is the time since the last server reset. Timestamp values wrap around (after about 49.7 days). The server, given its current time is represented by timestamp T, always interprets timestamps from clients by treating half of the timestamp space as being later in time than T. One timestamp value, named CurrentTime, is never generated by the server. This value is reserved for use in requests to represent the current server time."

via: https://www.x.org/releases/X11R7.7/doc/libX11/libX11/libX11.html#Pointer_Grabbing

So it looks like maybe the X11 folks already handle the timestamp rollover case, and kwin is being too clever?
Comment 11 Martin Flöser 2017-04-28 20:56:37 UTC
That is the documentation of Xlib, but KWin uses xcb. Mostly the two libraries match, but sometimes there are differences. Also a spec is one thing, the other is how the one and only X-Server behaves. Long story short: I don't trust any x documentation.
Comment 12 Kenneth Lakin 2017-04-28 21:41:46 UTC
(In reply to Martin Gräßlin from comment #11)
> That is the documentation of Xlib, but KWin uses xcb. Mostly the two
> libraries match, but sometimes there are differences.

Fair enough, but the XCB guide [0] says:

"For XCB, the documentation relies even more heavily on the protocol specifications. The generated API is an exact mapping to the X protocol; it translates the C call data into X protocol encoded packets as straightforwardly as possible."

I would _expect_ that XCB wouldn't do something that runs contrary to the xlib documentation's definition of basic datatypes, such as timestamps. Maybe you've gotten bitten by just such an incompatibility, though?

> Also a spec is one thing, the other is how the one and only
> X-Server behaves.

True, but I expect a few things to be true
* xlib's description of how timestamp rollover is handled in the X server is how it's intended to be handled.
* The x.org developers working on the parts of the server that handle timestamps aren't careless noobs, so xorg's handling of timestamp rollover is going to match the described behavior.

> Long story short: I don't trust any x documentation.

That's not an entirely unreasonable stance to have. But, if you can't easily test to check whether the documented behavior matches the implemented behavior, isn't it more pragmatic to code as if the documented behavior is true, and rely on user bug reports (supported by "The implemented behavior didn't match the docs!" prints in the code, where practical) to let you know when the implementation doesn't match the docs? :)

[0] https://www.x.org/wiki/guide/xlib-and-xcb/#index5h2
Comment 13 Martin Flöser 2017-04-29 09:04:32 UTC
> I would _expect_ that XCB wouldn't do something that runs contrary to the
> xlib documentation's definition of basic datatypes, such as timestamps.
> Maybe you've gotten bitten by just such an incompatibility, though?

Unfortunately yes. Recent case: https://phabricator.kde.org/D5549

Also there are differences in the library. E.g. Xlib is older than 64 bit, so some parts which are in the protocol defined as int32 are set to long in Xlib, which results in nice incompatible code on 64 bit. So yeah it doesn't match 1:1 and in these aspects doesn't match the X protocol. That's why I don't trust the XLib library any more for interaction with X.

Anyway I checked the x protocol and timestamp is a card32 which would support the description of XLib. Unfortunately the x protocol doesn't talk about how timestamps are intended to be.

> 
> > Also a spec is one thing, the other is how the one and only
> > X-Server behaves.
> 
> True, but I expect a few things to be true
> * xlib's description of how timestamp rollover is handled in the X server is
> how it's intended to be handled.
> * The x.org developers working on the parts of the server that handle
> timestamps aren't careless noobs, so xorg's handling of timestamp rollover
> is going to match the described behavior.
> 
> > Long story short: I don't trust any x documentation.
> 
> That's not an entirely unreasonable stance to have. But, if you can't easily
> test to check whether the documented behavior matches the implemented
> behavior, isn't it more pragmatic to code as if the documented behavior is
> true, and rely on user bug reports (supported by "The implemented behavior
> didn't match the docs!" prints in the code, where practical) to let you know
> when the implementation doesn't match the docs? :)

well I would like to fix it for good and not have to wait for the next bug fix release + 50 days to figure out whether it worked ;-)
Comment 14 Christian Gerloff 2017-05-02 14:51:10 UTC
(In reply to Martin Gräßlin from comment #13)
> well I would like to fix it for good and not have to wait for the next bug
> fix release + 50 days to figure out whether it worked ;-)

The Linux kernel also has a counter that could wrap around and produce potential errors. The devs help themselves by setting the initial value of that counter very high, so a wrap occurs soon after boot:

/*
 * Have the 32 bit jiffies value wrap 5 minutes after boot
 * so jiffies wrap bugs show up earlier.
 */
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))


Could the same be done for the X-Server, if just for debugging and testing?
Comment 15 Kenneth Lakin 2017-05-04 18:40:59 UTC
(In reply to Christian Gerloff from comment #14)
> The devs help themselves by setting the initial value of
> that counter very high, so a wrap occurs soon after boot:

Why's that better than doing what the X server is documented to do, [0] namely:

"Timestamp values wrap around (after about 49.7 days). The server, given its current time is represented by timestamp T, always interprets timestamps from clients by treating half of the timestamp space as being later in time than T."

[0] And almost certainly _does_ do, given that KWin 4 and numerous other X11 window managers manage to run for > 49.7 days without issue.
Comment 16 Martin Flöser 2017-05-05 17:11:56 UTC
Git commit 0bec9ad7337536e319c17c5684d97e1156399fdb by Martin Gräßlin.
Committed on 03/05/2017 at 19:36.
Pushed by graesslin into branch 'Plasma/5.8'.

Improve the x11 timestamp handling

Summary:
So far KWin only updated the x11 timestamp if the new timestamp is larger
than the existing one. While this is a useful thing it creates problems
when the 32 bit msec based time stamp wraps around which happens after
running an X server for 49 days. After the timestamp wrapped around KWin
would not update the timestamp any more and thus some calls might fail.
Most prominent victims are keyboard and pointer grab which fails as the
timestamp is either larger than the server timestamp or smaller than the
last grab timestamp.

Another problem related to timestamp handling is KWin getting broken by
wrong timestamps sent by applications. A prominent example is clusterssh
which used to send a timestamp as unix time which is larger than the
x timestamp and thus our timestamp gets too large.

This change addresses these problems by allowing to reset the timestamp.
This is only used from updateXTime (which is normally invoked before we
do things like grabKeyboard). Thus we make QX11Info::getTimestamp the
ultimate trusted source for timestamps.
Related: bug 348569
FIXED-IN: 5.8.7

Test Plan: As I cannot wait 50 days: unit tests for the two conditions added.

Reviewers: #kwin, #plasma

Subscribers: plasma-devel, kwin

Tags: #kwin

Differential Revision: https://phabricator.kde.org/D5704

M  +11   -0    autotests/CMakeLists.txt
A  +123  -0    autotests/test_x11_timestamp_update.cpp     [License: GPL (v2)]
M  +6    -2    main.h
M  +1    -1    utils.cpp
M  +3    -3    utils.h

https://commits.kde.org/kwin/0bec9ad7337536e319c17c5684d97e1156399fdb
Comment 17 Martin Flöser 2017-06-30 14:48:22 UTC
*** Bug 381823 has been marked as a duplicate of this bug. ***
Comment 18 bellersenm+bugs.kde 2017-08-22 09:37:48 UTC
50 days have past on my machine.
$ uptime
 10:54:38 up 50 days, 4 min

Alt-Tab is still working. I can move and resize windows around.
Nice.

But copying from old windows is still broken.
When I copy (mouse select, mouse select - right click - copy, select Ctrl-C) in KDE/Qt-Windows, nothing is copied. No new entry in klipper. And something different then the copied thing when I paste somewhere (Ctrl-V, mouse middle click, mouse right click - paste).
It seems specific to those KDE/Qt-Windows, because copying from Firefox, XTerm, GVim still works. And pasting in old KDE/Qt-Windows works too.
When I restart the application, it works normally again.

Debian-Testing
$ dpkg -l | grep kwin
ii  kwin-addons                                                      4:5.8.5-2                         amd64        additional desktop and window switchers for KWin
ii  kwin-common                                                      4:5.8.7-1                         amd64        KDE window manager, common files
ii  kwin-data                                                        4:5.8.7-1                         all          KDE window manager data files
ii  kwin-style-breeze                                                4:5.8.5-2                         amd64        KWin Breeze Style
ii  kwin-x11                                                         4:5.8.7-1                         amd64        KDE window manager, X11 version
ii  libkwin4-effect-builtins1                                        4:5.8.7-1                         amd64        KDE window manager effect builtins library
ii  libkwineffects9                                                  4:5.8.7-1                         amd64        KDE window manager effects library
ii  libkwinglutils9                                                  4:5.8.7-1                         amd64        KDE window manager gl utils library
ii  libkwinxrenderutils9                                             4:5.8.7-1                         amd64        KDE window manager render utils library
ii  qml-module-org-kde-kwindowsystem:amd64                           5.28.0-1                          amd64        provides integration of QML and KDE frameworks - kwindowsystem
Comment 19 Martin Flöser 2017-08-22 15:10:23 UTC
The copy but is a different issue and sounds a lot like a bug in Qt. Please report there.