Bug 438097

Summary: [AnimationEffect] kwin_wayland sometimes crashes when xwayland tries to display a popup
Product: [Plasma] kwin Reporter: Lewis Lakerink <lewis>
Component: wayland-genericAssignee: KWin default assignee <kwin-bugs-null>
Status: RESOLVED FIXED    
Severity: crash CC: aleixpol, harry, lacaceres97, lewis, nate, telepathine, vlad.zahorodnii, volchok
Priority: NOR    
Version: 5.22.90   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In: 5.23
Sentry Crash Report:
Attachments: Backtrace
Backtrace on 5.22.4
Backtrace from 5.22.90
Code completion popup window IDs in the debug console
Screencast demonstrating the rising amount of open files by KWin
Outputs of lsof -p $KWIN_PID taken every second, from the test start to KWin crash
Timelapse of lsof count, after increasing the open fd limit to 8192
A C program that reproduces the bug
A Python script that parses strace output for FD information
Potential solution (untested)

Description Lewis Lakerink 2021-06-04 23:43:34 UTC
SUMMARY
(I assume) Some kind of internal effects change causes animations to be added with a null EffectWindow* key in m_animations.

I've also observed crashes in AnimationEffect::paintWindow() - however I don't have a stack trace or line number for this, as I was using the ArchLinux 5.21.90 packages without debug info.

STEPS TO REPRODUCE
1. Use an xwayland jetbrains IDE for some time
2. Eventually right clicking the source code to bring up the context menu will crash kwin
3. The crash can take minutes or hours of coding to occur, but it's almost always triggered for me by right clicking in the jetbrains IDE.

OBSERVED RESULT
She's dead, Jim.

Core was generated by `kwin_wayland --wayland_fd 4 --xwayland /usr/lib/startplasma-waylandsession'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa0303c6853 in KWin::AnimationEffect::updateLayerRepaints (this=this@entry=0x55e9754325f0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:925sometimes
925                 const QRect geo = entry.key()->expandedGeometry();
[Current thread is 1 (Thread 0x7fa02953b840 (LWP 1747))]
(gdb) bt
#0  0x00007fa0303c6853 in KWin::AnimationEffect::updateLayerRepaints() (this=this@entry=0x55e9754325f0)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:925
#1  0x00007fa0303c74cf in KWin::AnimationEffect::postPaintScreen() (this=0x55e9754325f0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:703
#2  0x00007fa0306a8351 in KWin::EffectsHandlerImpl::postPaintScreen() (this=0x55e974fb35d0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/effects.cpp:419
#3  0x00007fa03073f0a7 in KWin::Scene::paintScreen(int*, QRegion const&, QRegion const&, QRegion*, QRegion*, KWin::RenderLoop*, QMatrix4x4 const&)
    (this=this@entry=0x55e9747823c0, mask=mask@entry=0x7ffd2f6b78b4, damage=..., repaint=..., updateRegion=updateRegion@entry=0x7ffd2f6b78b8, validRegion=validRegion@entry=0x7ffd2f6b78c0, renderLoop=<optimized out>, projection=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/scene.cpp:206
#4  0x00007fa0263f31f1 in KWin::SceneOpenGL::paint(int, QRegion const&, QList<KWin::Toplevel*> const&, KWin::RenderLoop*) (this=
    0x55e9747823c0, screenId=0, damage=<optimized out>, toplevels=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/plugins/scenes/opengl/scene_opengl.cpp:696
#5  0x00007fa03066d388 in KWin::Compositor::composite(KWin::RenderLoop*) (this=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/composite.cpp:628
#6  0x00007fa02eba83bb in  () at /usr/lib/libQt5Core.so.5
#7  0x00007fa03061cb72 in KWin::RenderLoop::frameRequested(KWin::RenderLoop*) (this=<optimized out>, _t1=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/build/src/kwin_autogen/EWIEGA46WW/moc_renderloop.cpp:206
#8  0x00007fa03072b703 in KWin::RenderLoopPrivate::dispatch() (this=0x55e974f66ed0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/renderloop.cpp:162
#9  0x00007fa02eba83bb in  () at /usr/lib/libQt5Core.so.5
#10 0x00007fa02ebac38b in QTimer::timeout(QTimer::QPrivateSignal) () at /usr/lib/libQt5Core.so.5
#11 0x00007fa02eb9e07f in QObject::event(QEvent*) () at /usr/lib/libQt5Core.so.5
#12 0x00007fa02faffd62 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /usr/lib/libQt5Widgets.so.5
#13 0x00007fa02eb7100a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /usr/lib/libQt5Core.so.5
#14 0x00007fa02ebc8fad in QTimerInfoList::activateTimers() () at /usr/lib/libQt5Core.so.5
#15 0x00007fa02ebc7489 in QEventDispatcherUNIX::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#16 0x000055e972c2d47e in QUnixEventDispatcherQPA::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) ()
#17 0x00007fa02eb6f97c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#18 0x00007fa02eb77ee4 in QCoreApplication::exec() () at /usr/lib/libQt5Core.so.5
#19 0x000055e972bc73c6 in main(int, char**) (argc=<optimized out>, argv=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/main_wayland.cpp:719
(gdb)


EXPECTED RESULT
kwin_wayland doesn't crash.

SOFTWARE/OS VERSIONS
Operating System: Arch Linux
KDE Plasma Version: 5.21.90
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.2
Kernel Version: 5.12.9-arch1-1 (64-bit)
Graphics Platform: Wayland
Processors: 12 × AMD Ryzen 5 PRO 4650U with Radeon Graphics
Memory: 14.9 GiB of RAM
Graphics Processor: AMD RENOIR

ADDITIONAL INFORMATION
Speculation: I know Vlad has been doing a lot of work for 5.22 in the area of effects and scene graph changes, I assume a bug has crept in from this work at some point. 5.21 never had this problem for me.

If I get further traces with different error locations related to effects/animations, I'll post them with this ticket.

I've built the current master with debug in case the issue from 5.21.90 had been fixed upstream, but it hasn't (and I needed a backtrack), which is why I'm reporting this against current master.
Comment 1 Lewis Lakerink 2021-06-05 01:37:39 UTC
Another related backtrace. Similar cause, the line in question is "entry.key()->addLayerRepaint(entry->second);"

This time the trigger for the crash was a code completion popup in PyCharm.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000300000001 in ?? ()
[Current thread is 1 (Thread 0x7f50f658b840 (LWP 12571))]
(gdb) bt
#0  0x0000000300000001 in  ()
#1  0x00007f50fd4174c3 in KWin::AnimationEffect::postPaintScreen() (this=0x55fa962bccb0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:713
#2  0x00007f50fd6f8351 in KWin::EffectsHandlerImpl::postPaintScreen() (this=0x55fa95f75110) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/effects.cpp:419
#3  0x00007f50fd78f0a7 in KWin::Scene::paintScreen(int*, QRegion const&, QRegion const&, QRegion*, QRegion*, KWin::RenderLoop*, QMatrix4x4 const&)
    (this=this@entry=0x55fa9552d240, mask=mask@entry=0x7fff6b332894, damage=..., repaint=..., updateRegion=updateRegion@entry=0x7fff6b332898, validRegion=validRegion@entry=0x7fff6b3328a0, renderLoop=<optimized out>, projection=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/scene.cpp:206
#4  0x00007f50f466b1f1 in KWin::SceneOpenGL::paint(int, QRegion const&, QList<KWin::Toplevel*> const&, KWin::RenderLoop*) (this=
    0x55fa9552d240, screenId=0, damage=<optimized out>, toplevels=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/plugins/scenes/opengl/scene_opengl.cpp:696
#5  0x00007f50fd6bd388 in KWin::Compositor::composite(KWin::RenderLoop*) (this=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/composite.cpp:628
#6  0x00007f50fbbf83bb in  () at /usr/lib/libQt5Core.so.5
#7  0x00007f50fd66cb72 in KWin::RenderLoop::frameRequested(KWin::RenderLoop*) (this=<optimized out>, _t1=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/build/src/kwin_autogen/EWIEGA46WW/moc_renderloop.cpp:206
#8  0x00007f50fd77b703 in KWin::RenderLoopPrivate::dispatch() (this=0x55fa95e12a90) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/renderloop.cpp:162
#9  0x00007f50fbbf83bb in  () at /usr/lib/libQt5Core.so.5
#10 0x00007f50fbbfc38b in QTimer::timeout(QTimer::QPrivateSignal) () at /usr/lib/libQt5Core.so.5
#11 0x00007f50fbbee07f in QObject::event(QEvent*) () at /usr/lib/libQt5Core.so.5
#12 0x00007f50fcb4fd62 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /usr/lib/libQt5Widgets.so.5
#13 0x00007f50fbbc100a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /usr/lib/libQt5Core.so.5
#14 0x00007f50fbc18fad in QTimerInfoList::activateTimers() () at /usr/lib/libQt5Core.so.5
#15 0x00007f50fbc17489 in QEventDispatcherUNIX::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#16 0x000055fa9400747e in QUnixEventDispatcherQPA::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) ()
#17 0x00007f50fbbbf97c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#18 0x00007f50fbbc7ee4 in QCoreApplication::exec() () at /usr/lib/libQt5Core.so.5
#19 0x000055fa93fa13c6 in main(int, char**) (argc=<optimized out>, argv=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/main_wayland.cpp:719
(gdb)
Comment 2 Lewis Lakerink 2021-06-05 01:51:49 UTC
I'm not sure if it's related - but sometimes kwin_wayland won't crash - but instead all my xwayland clients die (I guess xwayland itself crashes), and I'm not able to start an xwayland client again without logging out of my session and back in.

Perhaps if the xwayland server crashes midway through a render xwayland related animations lose their EffectWindow?

I don't know how that works internally, I'm merely speculating - and xwayland dying may be unrelated to this ticket.
Comment 3 Vlad Zahorodnii 2021-06-07 06:47:05 UTC
As is, I don't see how kwin can crash in AnimationEffect::postPaintScreen(). EffectWindow should always be valid. Unless xwayland crashes for some reason and no EffectsHandler::windowDeleted signal is emitted.
Comment 4 Lewis Lakerink 2021-06-07 06:51:51 UTC
What can I do to help provide more details/debug info that might help with this?
Comment 5 Lewis Lakerink 2021-06-16 01:00:39 UTC
Annoyingly, I'm still seeing weird behaviour after time with xwayland and sometimes kwin crashing and restarting the session - however, I no longer get coredumps for kwin, and I've no idea why that is. The only evidence I have that a segfault occurred is in dmesg. I've set a huge max size on my core dumps, so no processes should be excluded from dumping, I'm not sure what else to do.
Comment 6 Lewis Lakerink 2021-06-16 23:55:38 UTC
Ok, the issue with core dumps was that my HDD was almost full.

I collected another trace today with the same problem, this time kwin was built from the 5.22 branch - commit: https://invent.kde.org/plasma/kwin/-/commit/9169445b9bbb91f27577baf7ecb247c2e31d5b0a

Core was generated by `kwin_wayland --wayland_fd 4 --xwayland /usr/lib/startplasma-waylandsession'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f200b5e4c4d in KWin::AnimationEffect::postPaintScreen (this=0x55b1bd381e60) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:713

warning: Source file is more recent than executable.
b713                        entry.key()->addLayerRepaint(entry->second);
[Current thread is 1 (Thread 0x7f2003fb5240 (LWP 1757))]
(gdb) bt
#0  0x00007f200b5e4c4d in KWin::AnimationEffect::postPaintScreen() (this=0x55b1bd381e60) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/libkwineffects/kwinanimationeffect.cpp:713
#1  0x00007f200b8c12c1 in KWin::EffectsHandlerImpl::postPaintScreen() (this=0x55b1bcf2b6d0) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/effects.cpp:419
#2  0x00007f200b95843f in KWin::Scene::paintScreen(int*, QRegion const&, QRegion const&, QRegion*, QRegion*, KWin::RenderLoop*, QMatrix4x4 const&)
    (this=this@entry=0x55b1bc6bf580, mask=mask@entry=0x7ffc6f25b034, damage=..., repaint=..., updateRegion=updateRegion@entry=0x7ffc6f25b038, validRegion=validRegion@entry=0x7ffc6f25b040, renderLoop=<optimized out>, projection=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/scene.cpp:209
#3  0x00007f2001510981 in KWin::SceneOpenGL::paint(int, QRegion const&, QList<KWin::Toplevel*> const&, KWin::RenderLoop*) (this=
    0x55b1bc6bf580, screenId=0, damage=<optimized out>, toplevels=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/plugins/scenes/opengl/scene_opengl.cpp:695
#4  0x00007f200b8862f8 in KWin::Compositor::composite(KWin::RenderLoop*) (this=<optimized out>, renderLoop=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/composite.cpp:628
#5  0x00007f2009da13bb in  () at /usr/lib/libQt5Core.so.5
#6  0x00007f200b8359d2 in KWin::RenderLoop::frameRequested(KWin::RenderLoop*) (this=<optimized out>, _t1=<optimized out>)
    at /home/lewis/Documents/kwin-arch/kwin/src/build/src/kwin_autogen/EWIEGA46WW/moc_renderloop.cpp:206
#7  0x00007f200b943453 in KWin::RenderLoopPrivate::dispatch() (this=0x55b1bc63f910) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/renderloop.cpp:162
#8  0x00007f2009da13bb in  () at /usr/lib/libQt5Core.so.5
#9  0x00007f2009da538b in QTimer::timeout(QTimer::QPrivateSignal) () at /usr/lib/libQt5Core.so.5
#10 0x00007f2009d9707f in QObject::event(QEvent*) () at /usr/lib/libQt5Core.so.5
#11 0x00007f200ab46d62 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /usr/lib/libQt5Widgets.so.5
#12 0x00007f2009d6a00a in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /usr/lib/libQt5Core.so.5
#13 0x00007f2009dc1fad in QTimerInfoList::activateTimers() () at /usr/lib/libQt5Core.so.5
#14 0x00007f2009dc0489 in QEventDispatcherUNIX::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#15 0x000055b1bbcfc86e in QUnixEventDispatcherQPA::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) ()
#16 0x00007f2009d6897c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib/libQt5Core.so.5
#17 0x00007f2009d70ee4 in QCoreApplication::exec() () at /usr/lib/libQt5Core.so.5
#18 0x000055b1bbc96406 in main(int, char**) (argc=<optimized out>, argv=<optimized out>) at /home/lewis/Documents/kwin-arch/kwin/src/kwin/src/main_wayland.cpp:727
Comment 7 Luis Cáceres 2021-06-23 23:54:35 UTC
Created attachment 139625 [details]
Backtrace

Have the same issue. As in Lewis' case it happens randomly when using CLion and any popup window is diplayed (context menu, code completion popup, etc). I've also had it happen once when opening a new window in Chromium.

I am also attaching a more detailed backtrace.

OS: Fedora 34
Frameworks: 5.83.0-1
Plasma: 5.22.1-1
Comment 8 Lewis Lakerink 2021-09-20 20:45:52 UTC
Possibly related issue: https://bugs.kde.org/show_bug.cgi?id=440094
Comment 9 Aleix Pol 2021-09-21 17:22:45 UTC
*** Bug 440094 has been marked as a duplicate of this bug. ***
Comment 10 Aleix Pol 2021-09-21 17:32:29 UTC
Been going through it, it's crashing on `entry.key()->addLayerRepaint(entry->second);` kwinanimationeffect.cpp:668 in master, matched from Plasma/5.22 branch. I am unable to reproduce or see why it's crashing though.

I tried Clion and it worked seemingly fine, I guess I'd have to use it for longer?

@Lewis, you mentioned that you were using 5.21.90, maybe you could try 5.22.90?

Could you mention which effects you have enabled?
Comment 11 Luis Cáceres 2021-09-21 19:24:30 UTC
It crashes seemingly randomly. Sometimes it takes just a few minutes, sometimes it takes several hours.

For completion, I've been having the issue on all plasma versions 5.22.x. My desktop effects on Fedora 34 with Plasma 5.22.4 are (as reported by `qdbus-qt5 org.kde.KWin /Effects loadedEffects`:

kwin4_effect_windowaperture
kwin4_effect_translucency
kwin4_effect_squash
kwin4_effect_sessionquit
zoom
kwin4_effect_morphingpopups
kwin4_effect_maximize
kwin4_effect_logout
kwin4_effect_login
kwin4_effect_fullscreen
kwin4_effect_frozenapp
kwin4_effect_fadingpopups
kwin4_effect_fade
kwin4_effect_dimscreen
kwin4_effect_dialogparent
slidingpopups
slide
screenshot
desktopgrid
colorpicker
wobblywindows
presentwindows
highlightwindow
blur
contrast
startupfeedback
kscreen
Comment 12 Luis Cáceres 2021-09-21 19:31:19 UTC
Created attachment 141776 [details]
Backtrace on 5.22.4

I'm also attaching a new backtrace I got just today. Hopefully it helps too.
Comment 13 Lewis Lakerink 2021-09-21 20:59:52 UTC
(In reply to Aleix Pol from comment #10)
> @Lewis, you mentioned that you were using 5.21.90, maybe you could try
> 5.22.90?
> 
> Could you mention which effects you have enabled?

Exactly the same symptoms for me with Pycharm and CLion. After some time (sometimes half an hour, sometimes several hours) triggering the intellisense popup will eventually trigger this error. I haven't seen this crash with any other xwayland applications however. 

I've been running kwin master (updated once or twice a week) more or less since I reported this bug, however since 5.22.90 was tagged I've changed to Plasma/5.23.

My effects:
$ qdbus-qt5 org.kde.KWin /Effects loadedEffects
kwin4_effect_windowaperture
kwin4_effect_squash
zoom
kwin4_effect_sessionquit
kwin4_effect_morphingpopups
kwin4_effect_maximize
kwin4_effect_logout
kwin4_effect_login
kwin4_effect_fullscreen
kwin4_effect_frozenapp
kwin4_effect_fadingpopups
kwin4_effect_fade
kwin4_effect_dialogparent
slidingpopups
slide
screenshot
desktopgrid
colorpicker
presentwindows
highlightwindow
blur
contrast
startupfeedback
screentransform
kscreen
Comment 14 Ash Blake 2021-09-21 22:08:18 UTC
Created attachment 141780 [details]
Backtrace from 5.22.90

This is a backtrace from Plasma 5.22.90 from the Arch Linux kde-unstable repo,
with the KWin package rebuilt from source to provide debug symbols.

The crash happened randomly while using Jetbrains IntelliJ IDEA.

My effects setup is mostly default, I only enabled the new Overview effect.
Effect list:
 - kwin4_effect_windowaperture
 - kwin4_effect_squash
 - kwin4_effect_sessionquit
 - kwin4_effect_morphingpopups
 - kwin4_effect_maximize
 - kwin4_effect_logout
 - kwin4_effect_login
 - kwin4_effect_fullscreen
 - kwin4_effect_frozenapp
 - kwin4_effect_fadingpopups
 - kwin4_effect_fade
 - kwin4_effect_dialogparent
 - zoom
 - slidingpopups
 - slide
 - screenshot
 - desktopgrid
 - colorpicker
 - presentwindows
 - overview
 - highlightwindow
 - blur
 - contrast
 - startupfeedback
 - screenedge
 - screentransform
 - kscreen


I analysed the core dump with GDB - the crash happened in kwin4_effect_fade. 
Redacted log from the gdb session (at the innermost stack frame):

  (gdb) set print pretty on
  (gdb) set print object on
  (gdb) p *this

    $1 = (KWin::ScriptedEffect) {
    ...
    members of KWin::ScriptedEffect:
    ...
    m_effectName = {
        d = 0x55f534b8f590
    },
    ...
    }

  (gdb) p this->m_effectName->d

    $3 = (QString::Data *) 0x55f534b8f590

  (gdb) set $data_start = (char*)this->m_effectName->d + this->m_effectName->d->offset
  (gdb) set $len = this->m_effectName->d->size
  (gdb) set $i = 0
  (gdb) while $i < 2*$len
           >printf "%c", *($data_start + $i)
           >set $i = $i + 2
           >end

    kwin4_effect_fade

When the crash happens again, I'll check if it was the same effect.
If so, I'll disable it and test some more.

Some information about that EffectWindow pointer:
  (gdb) p *(entry.i->key)

    $45 = {
      <QObject> = {<No data fields>}, 
      members of KWin::EffectWindow:
      static staticMetaObject = {
        d = {
          superdata = {
            direct = 0x7f9fae97c740 <QObject::staticMetaObject>
          },
          stringdata = 0x7f9fafda5ae0 <qt_meta_stringdata_KWin__EffectWindow>,
          data = 0x7f9fafda55a0 <qt_meta_data_KWin__EffectWindow>,
          static_metacall = 0x7f9fafd8ced0 <KWin::EffectWindow::qt_static_metacall(QObject*, QMetaObject::Call, int, void**)>,
          relatedMetaObjects = 0x0,
          extradata = 0x0
        }
      },
      d = {
        d = 0x20a00000282
      }
    }

  (gdb) p &(entry.i->key->d)
    $33 = (QScopedPointer<KWin::EffectWindow::Private, QScopedPointerDeleter<KWin::EffectWindow::Private> > *) 0x55f53571a4d0

  (gdb) p entry.i->key->d.d
    $34 = (KWin::EffectWindow::Private *) 0x20a00000282

  (gdb) p *(entry.i->key->d.d)
    Cannot access memory at address 0x20a00000282
Comment 15 Ash Blake 2021-09-22 00:16:42 UTC
Additional information about that EffectWindow, as I forgot it's actually an instance of EffectWindowImpl and didn't dump members of the latter. 
Both the toplevel and sw pointers lead to inaccessible memory addresses.

(gdb) p *(KWin::EffectWindowImpl *)entry.i->key
...
  members of KWin::EffectWindowImpl:
  static staticMetaObject = {
    d = {
      superdata = {
        direct = 0x7f9fafdb2900 <KWin::EffectWindow::staticMetaObject>
      },
      stringdata = 0x7f9fb01ae880 <qt_meta_stringdata_KWin__EffectWindowImpl>,
      data = 0x7f9fb01ae840 <qt_meta_data_KWin__EffectWindowImpl>,
      static_metacall = 0x7f9faff9e1a0 <KWin::EffectWindowImpl::qt_static_metacall(QObject*, QMetaObject::Call, int, void**)>,
      relatedMetaObjects = 0x0,
      extradata = 0x0
    }
  },
  toplevel = 0x2390000043d,
  sw = 0x20a00000282,
  dataMap = {
    {
      d = 0x2390000043d,
      e = 0x2390000043d
    }
  },
  managed = false,
  waylandClient = false,
  x11Client = false
...
Comment 16 Ash Blake 2021-09-22 12:17:06 UTC
Okay, it crashed again. The backtrace looks the same as the last one.
The crash is again in kwin4_effect_fade:

  (gdb) p this
    $3 = (KWin::ScriptedEffect * const) 0x55f805fa1440
  (gdb) set $data_start = (char*)this->m_effectName->d + this->m_effectName->d->offset
  (gdb) set $len = this->m_effectName->d->size
  (gdb) set $i = 0
  (gdb) while $i < 2*$len
       >printf "%c", *($data_start + $i)
       >set $i = $i + 2
       >end
    kwin4_effect_fade


I'll test the Glide window open/close animation now, as it's
a native one and may behave differently. I suppose there's no
point in testing Scale as it's a scripted effect and it will 
probably have the same problem Fade has.
Comment 17 Ash Blake 2021-09-22 12:39:22 UTC
This time, however, it seems EffectWindow was not completely destroyed.

Unlike in the last crash, the data inside EffectWindow is not complete garbage.
For instance, the addresses in toplevel and EffectWindow::Private pointers look
more sensible, dataMap contains valid pointers to QHashData::shared_null, and
x11Client is true.

  (gdb) p *(EffectWindowImpl*)(entry.i->key)
    $10 = {
      <KWin::EffectWindow> = {
        <QObject> = {<No data fields>}, 
        members of KWin::EffectWindow:
        static staticMetaObject = {
          ...
        },
        d = {
          d = 0x55f806b0fa80
        }
      }, 
      members of KWin::EffectWindowImpl:
      static staticMetaObject = {
        ...
      },
      toplevel = 0x55f80706bcf0,
      sw = 0x0,
      dataMap = {
        {
          d = 0x7fb16aa6e5c0 <QHashData::shared_null>,
          e = 0x7fb16aa6e5c0 <QHashData::shared_null>
        }
      },
      managed = true,
      waylandClient = false,
      x11Client = true
    }

Inspection of toplevel:

  (gdb) p *(Toplevel*)$ew.toplevel
    {
      <QObject> = {<No data fields>}, 
      members of KWin::Toplevel:
      static staticMetaObject = {
        ...
      },
      m_frameGeometry = {
        x1 = 640,
        y1 = 547,
        x2 = 1030,
        y2 = 654
      },
      m_clientGeometry = {
        x1 = 640,
        y1 = 547,
        x2 = 1030,
        y2 = 654
      },
      m_bufferGeometry = {
        x1 = 640,
        y1 = 547,
        x2 = 1030,
        y2 = 654
      },
      m_visual = 59,
      bit_depth = 24,
      info = 0x55f806a48670,
      ready_for_painting = true,
      m_internalFBO = {
        value = 0x0,
        d = 0x0
      },
      m_internalImage = <incomplete type>,
      m_internalId = {
        data1 = 3760014763,
        data2 = 36097,
        data3 = 19896,
        data4 = "\246\035\236h\342\216t\223"
      },
      m_client = {
        m_window = 18875715,
        m_destroy = false,
        m_logicGeometry = {
          x1 = 0,
          y1 = 0,
          x2 = -1,
          y2 = -1
        }
      },
      is_shape = false,
      effect_window = 0x0,
      m_shadow = 0x0,
      resource_name = {
        d = 0x55f806b26260
      },
      resource_class = {
        d = 0x55f805f58430
      },
      m_clientMachine = 0x55f806442ca0,
      m_wmClientLeader = 18874376,
      opaque_region = {
        d = 0x7fb16b3071e0 <QRegion::shared_empty>
      },
      m_shapeRegion = {
        d = 0x55f806cc0a30
      },
      m_shapeRegionIsValid = true,
      m_output = 0x55f805b4d020,
      m_skipCloseAnimation = false,
      m_surfaceId = 0,
      m_surface = 0x0,
      m_screenScale = 1,
      m_opacity = 1,
      m_stackingOrder = 5
    }
Comment 18 Ash Blake 2021-09-22 14:34:27 UTC
I didn't get a crash so far, but I revisited the previous coredump, as it seems
weird that the EffectWindow was at least a bit intact this time.

Disassembly of the current instruction and the one before it:
    0x7fb16c02e4c2 <...postPaintScreenEv+1682>      mov    (%rdi),%rax
  > 0x7fb16c02e4c5 <...postPaintScreenEv+1685>      call   *0x90(%rax)
The RDI register contained the address of the EffectWindow, same as in entry.key

EffectWindow::addLayerRepaint is virtual, so the address of the actual function
EffectWindowImpl::addLayerRepaint has to be read from the vtable.

(In this case) the first few bytes of EffectWindow should contain a vtable ptr, 
which is then read into RAX by 'mov (%rdi),%rax'.

The EffectWindowImpl::addLayerRepaint function pointer is then expected to 
exist at RAX + 0x90.

It looks like the vtable pointer points to an invalid location:
  (gdb) p/x $rax
    $23 = 0x55f806f2e
  (gdb) x/x $rax
    0x55f806f2e:    Cannot access memory at address 0x55f806f2e

And vtable+0x90 is also unreadable:
  (gdb) x/x $rax+0x90
    0x55f806fbe:    Cannot access memory at address 0x55f806fbe

In the coredump from yesterday, the situation is the same:
    0x7f9fafd984c2 <...postPaintScreenEv+1682>      mov    (%rdi),%rax
  > 0x7f9fafd984c5 <...postPaintScreenEv+1685>      call   *0x90(%rax)

  (gdb) p/x $rax
    $1 = 0x55f06b86305a
  (gdb) x/x $rax
    0x55f06b86305a: Cannot access memory at address 0x55f06b86305a
  (gdb) x/x $rax+0x90
    0x55f06b8630ea: Cannot access memory at address 0x55f06b8630ea


Yesterday the EffectWindow was destroyed completely, and the member variables
looked pretty much random. 

In today's crash it seems like the EffectWindow was in the process of getting 
deleted, as some of the member variables looked intact. The vtable however 
already got corrupted, which made the call result in a segfault.
Comment 19 Ash Blake 2021-09-22 19:57:54 UTC
Created attachment 141801 [details]
Code completion popup window IDs in the debug console

Update: It appears that the crashes are highly correlated with the amount of 
popup windows opened in total during a session of coding.

Each new popup causes an increment in the hexadecimal window ID and the window's 
caption (win1, win2, win3, etc.) seen in the KWin debug console.

For me, the crash happens somewhere around win300. 

I reproduced the crash three times in a row by typing a name of some object then 
repeatedly mashing dot and backspace keys, so that the autocompletion popups 
flash rapidly. 

Because the window IDs increase so quickly and the problem happens around a similar 
amount of open popup windows, could this be some sort of overflow problem? 

Maybe this could explain the weird corruption described in my previous two comments, 
where some fields of the EffectWindow even looked sensible, but the vtable located 
at the beginning of memory allocated for the EffectWindow was completely ruined.
Comment 20 Ash Blake 2021-09-22 20:51:21 UTC
This short script can be used to trigger a crash:

#!/bin/bash
win_id=$(sort <(xdotool search --name "Content window") <(xdotool search --class "jetbrains-idea") | uniq -d)
sleep 10
while :
do
  xdotool key --window $win_id period
  sleep 0.1
  xdotool key --window $win_id BackSpace
done

After opening a project in IDEA and typing something like 'System' in a line, 
run the script and switch back to the IntelliJ IDEA window. 

For me, with automated keymashing it takes 420-450 popups to crash the Wayland
session. On X11, it's completely stable even though the window identifiers also
rapidly increase. I terminated the test after around 1000 popups.
Comment 21 Ash Blake 2021-09-22 22:47:02 UTC
This time KWin didn't crash completely, but entered a weird state. 
Plasmashell stopped working, but most of the applications were still running, 
including IDEA. I stopped the test, because if it continued it would most likely
result in a crash. 

It is not possible to restart plasmashell right now, nor launch any new program.


These errors appeared in wayland-session.log when plasmashell crashed:

  file descriptor expected, object (308), message get_icon(7h)
  error in client communication (pid 496515)
  QMetaProperty::read: Unable to handle unregistered datatype 'KWin::SessionState' 
  for property 'KWin::EffectsHandlerImpl::sessionState'
  wl_display@1: error 1: invalid arguments for org_kde_plasma_window@308.get_icon


This appeared in terminal when trying to restart plasmashell:

  wl_display@1: error 1: invalid arguments for wl_shm@81.create_pool
  The Wayland connection experienced a fatal error: Invalid argument


And around the same time, this appeared in wayland-session.log:

  file descriptor expected, object (81), message create_pool(nhi)
  error in client communication (pid 498931)
  QMetaProperty::read: Unable to handle unregistered datatype 'KWin::SessionState' 
  for property 'KWin::EffectsHandlerImpl::sessionState'


While typing this message, Firefox and a bunch of other things have crashed.
The following got logged to wayland-session.log:

  file descriptor expected, object (30), message add(huuuuu)
  error in client communication (pid 499807)
  wl_display@1: error 1: invalid arguments for zwp_linux_buffer_params_v1@30.add
  [266 00:35:02.451499] [glfw error 65544]: Wayland: fatal display error: Invalid argument
  file descriptor expected, object (30), message add(huuuuu)
  error in client communication (pid 499827)
  wl_display@1: error 1: invalid arguments for zwp_linux_buffer_params_v1@30.add
  [266 00:35:06.638735] [glfw error 65544]: Wayland: fatal display error: Invalid argument
  file descriptor expected, object (30), message add(huuuuu)
  error in client communication (pid 499851)
  file descriptor expected, object (30), message add(huuuuu)
  error in client communication (pid 499907)
  file descriptor expected, object (30), message add(huuuuu)
  error in client communication (pid 499954)
  wl_display@1: error 1: invalid arguments for zwp_linux_buffer_params_v1@30.add
  [266 00:35:18.863341] [glfw error 65544]: Wayland: fatal display error: Invalid argument
  file descriptor expected, object (63), message add(huuuuu)
  error in client communication (pid 499370)
  file descriptor expected, object (7), message create_pool(nhi)
  error in client communication (pid 500198)
  file descriptor expected, object (7), message create_pool(nhi)
  error in client communication (pid 500233)
  kwin_core: Could not trust "/usr/bin/plasma-browser-integration-host" sha "" ""


This is followed by 6649 repetitions of this line:

  failed to accept: Too many open files
Comment 22 Ash Blake 2021-09-22 23:32:44 UTC
Created attachment 141806 [details]
Screencast demonstrating the rising amount of open files by KWin

Regarding the previous error about too many open files:

This screencast shows the count of open files during the popup window test.
It is steadily rising, and does not decrease after stopping the test, nor
after quitting IntelliJ IDEA completely. 
If the test gets stopped at, say, 1000 open fds, after resuming it
a crash happens pretty quickly.

The screencast ends when KWin crashes, which happens after around 1400
file descriptors get opened.

In the following message I will attach a .tar.gz archive containing the
output of lsof -p $KWIN_PID taken every second since starting the test.

The offending descriptors are mostly pipes.
Comment 23 Ash Blake 2021-09-22 23:35:34 UTC
Created attachment 141807 [details]
Outputs of lsof -p $KWIN_PID taken every second, from the test start to KWin crash

Additional logs for #22
Comment 24 Ash Blake 2021-09-23 00:10:11 UTC
Created attachment 141808 [details]
Timelapse of lsof count, after increasing the open fd limit to 8192

This shows the test running for much longer (5 minutes), after max file descriptors 
for KWin were increased. It could run for a bit longer as fds were not exhausted yet. 
The interesting thing is how again plasmashell crashed early.

This proves that things just go crazy after the file descriptor limit is exhausted,
and there is no direct problem in AnimationEffect. KWin crashes in that place, but
the real problem is the file descriptor leak and all kinds of chaotic behaviour it 
can cause.


Are there any logs I can provide to help debug this issue?
Comment 25 Ash Blake 2021-09-23 07:37:10 UTC
This is pretty weird. I just tried running IDEA in KWin only, started from a TTY.
The file descriptor amount does not skyrocket, and everything is stable.

It also works fine in nested KWin inside a Plasma session.
Seems like some other Plasma component is causing KWin to leak file descriptors.

I tried killing various combinations of Plasma processes, and it looks like killing
plasmashell, xdg-desktop-portal, and xdg-desktop-portal-kde helps.
KWin does not leak file descriptors anymore, and it does not crash.

I guess plasmashell and xdg-desktop-portal-kde could be trying to read some information
about the popup window, but failure might not be handled correctly and new pipe descriptors 
get created over and over during each retry attempt until KWin crashes.
Comment 26 Lewis Lakerink 2021-09-23 23:29:33 UTC
Firstly can I just say Ash, thanks for your in depth analysis! It's been super educational for me to see your process and how you use GDB to debug the issue. Nice find with the fds, I can confirm that that is the same for me. Watching the fd count while opening/closing popups in pycharm shows that fds are increasing until they reach ~the soft limit of 1024 on my system before plasma goes crazy/kwin crashes. I've just set the soft/hard fd limit to a million and will see if that prevents pycharm/clion crashing kwin as regularly. I'll post my findings after some testing. (Less than ideal workaround for sure, but if that temporarily resolves this problem until it can be fixed - I can live with that!)

>   wl_display@1: error 1: invalid arguments for
> zwp_linux_buffer_params_v1@30.add
>   [266 00:35:18.863341] [glfw error 65544]: Wayland: fatal display error:
> Invalid argument

Looks exactly like the same symptoms I had for https://bugs.kde.org/show_bug.cgi?id=438381, so I'd go so far as to say that issue is related to this issue.
Comment 27 Ash Blake 2021-09-24 00:39:17 UTC
The file descriptor flood comes from PlasmaWindow::Private::iconChangedCallback
(kwayland/src/client/plasmawindowmanagement.cpp:655, master branch).
This can be easily verified by catching pipe2 in the plasmashell process - 
the backtrace will show this function. Occurences of pipe2 in the strace
output for plasmashell correlate with the amount of KWin's file descriptors.

After calling the org_kde_plasma_window_get_icon interface, the FDs get copied
over to KWin's process using Wayland's proxy magic, which seems to be using 
the sendmsg and recvmsg syscalls with SCM_RIGHTS ancillary messages, which 
enable passing open descriptors between processes.

The function which gets called after marshaling the FDs by the Wayland's proxy 
thingy is PlasmaWindowInterfacePrivate::org_kde_plasma_window_get_icon
(kwayland-server/src/server/plasmawindowmanagement_interface.cpp:437)

Things get really odd now. Both of those functions look fine to me - it doesn't 
look like either the server side or the client side would leave a descriptor 
open after finishing its work. I walked through both the icon read and write in 
GDB, and everything was seemingly handled correctly for the cases I observed.

Despite no errors, many of these supposedly closed descriptors did still appear 
in /proc/$KWIN_PID/fd. I have no idea what could be going on.  

For a test, I replaced  org_kde_plasma_window_get_icon with a function that only 
calls close(fd). Somehow, the KWin process still had rapidly increasing amounts 
of FDs, even though this time it was just supposed to close them ASAP.
Comment 28 Lewis Lakerink 2021-09-24 00:49:22 UTC
Ok, so setting the hard/soft fd limit to a high value appears to be an effective solution for me for now until this issue is resolved.

Using Ash's xdotool script modified for pycharm, I got the kwin fd count up over 50,000 (took ~35 minutes) and no kwin crash, plasmashell is also still working fine before I stopped the xdotool script.

I measured the fd's open by kwin using watch -n 1 "sudo ls -la /proc/`pidof kwin_wayland`/fd | wc -l"

$ ulimit -Sn
1048576
$ ulimit -Hn
1048576
$ sudo ls -la /proc/`pidof kwin_wayland`/fd | wc -l
50074

For posterity sake, I changed my limits by editing /etc/security/limits.conf and adding the following lines to the end and rebooting.

lewis           soft    nofile 1048576
lewis           hard    nofile 1048576
Comment 29 Ash Blake 2021-09-24 14:56:42 UTC
Created attachment 141874 [details]
A C program that reproduces the bug

I figured out which behaviour of the Jetbrains IDEs causes the bug to occur and I wrote a program that replicates it, so the bug can be reproduced reliably and without the need to install any software.

This program will repeatedly create a specified amount of windows, waiting some time before destroying the current window and creating the next one. After that, it will wait a specified amount of time and repeat the whole thing.

The amount of windows and the time constants are optional program arguments. The default settings are supposed to replicate a realistic scenario that could happen during the IDE usage. They are as follows:
  - (arg1) burst size: 2
  - (arg2) window lifetime (ms): 16
  - (arg3) post-burst wait time (ms): 3000
The short burst length of 2 was motivated by observing the popup windows flicker without any content sometimes, before the actual popup appeared.

On my machine, this burst size and window lifetime causes the KWin's file descriptor count to increase by 3 each repetition, same as when a popup is opened normally in a Jetbrains IDE. Of course, if such settings do not cause bug reproduction in your environment, start by increasing the burst size. If the bug still isn't reproduced with a high amount of created windows, start decreasing the window lifetime.

Some observations: 
1. If the window lifetime is large enough to reliably observe the window's icon appearing on the taskbar, this bug does not occur. On my machine, a window lifetime of 60ms is the minimum one that does not cause KWin's file descriptor amount to notably rise at any point in time, even with a burst size of 99999 windows.
2. Window lifetimes less or equal to 4ms will cause KWin to lag, up to the point of a complete freeze (which resolves during the wait time). Of course this is a terribly unrealistic scenario and no application would ever do that, but watch out when testing.


Compilation:
$ gcc reproduce_438097.c -lX11 -o reproduce_438097

Usage
$ ./reproduce_438097 [burst size] [window lifetime] [post-burst wait time]
Comment 30 Ash Blake 2021-09-24 15:17:50 UTC
> The short burst length of 2 was motivated by observing the popup windows
> flicker without any content sometimes, before the actual popup appeared.

Post scriptum: in fact, when GDB is attached to plasmashell in non-stop mode and a breakpoint is set in PlasmaWindow::Private::iconChangedCallback with commands specified so that it does not pause execution, for instance:

  break PlasmaWindow::Private::iconChangedCallback
  commands
  silent
  printf "hello\n"
  end

, we can observe the string "hello" appearing twice when a single popup window appears, and also twice when a single popup disappears, suggesting there was another window that appeared for several miliseconds before getting destroyed.
Comment 31 Ash Blake 2021-09-24 15:19:56 UTC
(In reply to Ash Blake from comment #30)
>   break PlasmaWindow::Private::iconChangedCallback
>   commands
>   silent
>   printf "hello\n"
>   end

Oops, forgot a continue there.
  break PlasmaWindow::Private::iconChangedCallback
  commands
  silent
  printf "hello\n"
  cont
  end
Comment 32 Ash Blake 2021-09-24 16:36:17 UTC
Created attachment 141875 [details]
A Python script that parses strace output for FD information

I wrote a script to parse strace output and abbreviate it, displaying only close() calls and recvmsg() calls, but filtered by cmsg_type=SCM_RIGHTS and abbreviated so that only the cmsg_data part containing newly received file descriptors is visible.

After terminating the script with Ctrl+C, it will display all the file descriptors that have been received by the KWin process, but not closed.

This is the output in a bug-free situation (the reproducing program was ran with a burst size of 20 and window lifetime of 100ms, which does not trigger the bug)

    $ sudo strace -e trace=recvmsg,close -p `pidof kwin_wayland` 2>&1 | python process_strace.py
    close(5670)                             = 0
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670],...) = 84
    recvmsg(48,...,cmsg_data=[5670],...) = 8
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    recvmsg(44,...,cmsg_data=[5670],...) = 152
    close(5670)                             = 0
    recvmsg(48,...,cmsg_data=[5670, 5675],...) = 16
    ^CThese descriptors were left open:
    5670, 5675

The mentioned descriptors have been however closed shortly after.


This is the output with also 20 windows, but a lifetime of 16ms:
    $ sudo strace -e trace=recvmsg,close -p `pidof kwin_wayland` 2>&1 | python process_strace.py
    recvmsg(44,...,cmsg_data=[5696],...) = 152
    close(5696)                             = 0
    recvmsg(48,...,cmsg_data=[5696, 5697],...) = 16
    recvmsg(44,...,cmsg_data=[5696],...) = 152
    close(5696)                             = 0
    recvmsg(48,...,cmsg_data=[5696, 5697],...) = 16
    recvmsg(44,...,cmsg_data=[5700],...) = 152
    close(5700)                             = 0
    recvmsg(48,...,cmsg_data=[5700, 5701],...) = 16
    recvmsg(44,...,cmsg_data=[5700],...) = 152
    close(5700)                             = 0
    recvmsg(44,...,cmsg_data=[5700],...) = 152
    close(5700)                             = 0
    recvmsg(48,...,cmsg_data=[5700, 5701],...) = 16
    recvmsg(44,...,cmsg_data=[5702],...) = 152
    close(5702)                             = 0
    recvmsg(44,...,cmsg_data=[5702],...) = 152
    close(5702)                             = 0
    recvmsg(48,...,cmsg_data=[5702, 5703],...) = 16
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    close(5704)                             = 0
    recvmsg(44,...,cmsg_data=[5704],...) = 152
    recvmsg(48,...,cmsg_data=[5705, 5706],...) = 16
    close(5704)                             = 0
    ^CThese descriptors were left open:
    5696, 5697, 5700, 5701, 5702, 5703, 5705, 5706


All of these descriptors were open after waiting a few seconds and checking, and are still open while I'm typing this.
Comment 33 Vlad Zahorodnii 2021-09-24 16:45:47 UTC
Thanks for the great analysis, Ash! This is definitely very strange. At quick glance, I don't see how file descriptors can be leaked in kwin.
Comment 34 Ash Blake 2021-09-24 16:58:53 UTC
(In reply to Vlad Zahorodnii from comment #33)
> Thanks for the great analysis, Ash! This is definitely very strange. At
> quick glance, I don't see how file descriptors can be leaked in kwin.

Well, it doesn't seem to be KWin that leaks them. It happens along the way in libwayland.

This one is perhaps more useful - same test as above, but only one window gets created with a 16ms lifetime. It appears this is actually enough to induce the bug, and patterns are more visible.

  close(4071)                             = 0
  recvmsg(44,...,cmsg_data=[4071],...) = 152
  close(4071)                             = 0
  recvmsg(44,...,cmsg_data=[4034],...) = 152
  close(4034)                             = 0
  recvmsg(48,...,cmsg_data=[4034, 4035],...) = 16
  recvmsg(44,...,cmsg_data=[4071],...) = 152
  close(4071)                             = 0
  recvmsg(48,...,cmsg_data=[4071, 4079],...) = 16
  recvmsg(44,...,cmsg_data=[4150],...) = 152
  close(4150)                             = 0
  recvmsg(48,...,cmsg_data=[4150, 4608],...) = 16
  recvmsg(44,...,cmsg_data=[4610],...) = 152
  close(4610)                             = 0
  recvmsg(48,...,cmsg_data=[4610, 4611],...) = 16
  recvmsg(44,...,cmsg_data=[4612],...) = 152
  close(4612)                             = 0
  recvmsg(48,...,cmsg_data=[4612, 4615],...) = 16
  recvmsg(44,...,cmsg_data=[4618],...) = 152
  close(4618)                             = 0
  recvmsg(48,...,cmsg_data=[4618, 4619],...) = 16
  recvmsg(44,...,cmsg_data=[4620],...) = 152
  close(4620)                             = 0
  recvmsg(48,...,cmsg_data=[4620, 4621],...) = 16
  ^CThese descriptors were left open:
  4034, 4035, 4071, 4079, 4150, 4608, 4610, 4611, 4612, 4615, 4618, 4619, 4620, 4621

KWin does close every descriptor it receives. However, sometimes the process does not receive one descriptor, but two - and KWin itself is not aware of the second one, nor is it supposed to.


This is the bug-free scenario (100ms lifetime):
  close(4622)                             = 0
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  recvmsg(44,...,cmsg_data=[4622],...) = 152
  close(4622)                             = 0
  recvmsg(48,...,cmsg_data=[4622, 5710],...) = 16
  ^CThese descriptors were left open:
  4622, 5710

(And those two descriptors got closed shortly after)

Here the same thing happens, however things happen slowly enough so that the descriptor can get reused, hence the fd amount is not rising. 

It seems like this is not KWin's fault, but it's Wayland that is doing something weird when marshaling those descriptors. org_kde_plasma_window_get_icon is supposed to accept one file descriptor, and plasmashell does give it exactly one descriptor. Things get messed up along the way, and this is not an issue anywhere in the KDE code.

While researching the topic of SCM_RIGHTS, I stumbled upon this link:
https://gist.github.com/kentonv/bc7592af98c68ba2738f4436920868dc

This part sounds like it could be a problem here:
> However, as always, recvmsg() calls on the receiving end don't necessarily 
> map 1:1 to sendmsg() calls. Messages can be coalesced or split.

Sounds like things can get mixed up when messages are getting sent fast enough.
Comment 35 Vlad Zahorodnii 2021-09-24 16:59:02 UTC
Aha, I know what it is... It's a bug in our handling of inert resources.

If a plasma window is deleted, our wrapper code won't call the handler and therefore the file descriptor will be leaked.
Comment 36 Vlad Zahorodnii 2021-09-24 17:10:51 UTC
Created attachment 141876 [details]
Potential solution (untested)

If you apply the attached patch, does the issue go away?
Comment 37 Ash Blake 2021-09-24 17:44:08 UTC
(In reply to Vlad Zahorodnii from comment #36)
> Created attachment 141876 [details]
> Potential solution (untested)
> 
> If you apply the attached patch, does the issue go away?

Yes, I just tested it with 1000 windows and there are no leaked descriptors. In the strace output, there are now close() calls for every FD. When using a Jetbrains IDE everything works fine as well.
Comment 38 Lewis Lakerink 2021-09-24 20:48:17 UTC
+1, your patch prevents the uncontrolled growth of fds for me Vlad!

Thanks!
Comment 39 Bug Janitor Service 2021-09-27 09:42:48 UTC
A possibly relevant merge request was started @ https://invent.kde.org/plasma/kwayland-server/-/merge_requests/299
Comment 40 Vlad Zahorodnii 2021-09-27 18:58:49 UTC
Git commit ca07c765f50f37bdf5d213234ef6c012af794587 by Vlad Zahorodnii.
Committed on 27/09/2021 at 18:02.
Pushed by vladz into branch 'master'.

Automatically close fds for inert resources

If the plasma window resource is inert, the associated get_icon handler
won't be invoked. This results in leaked file descriptors.

With this change, the corresponding close() function call will be
generated to avoid leaking file descriptors and potentially crashing the
session due to kwin running out of fds.

M  +7    -0    src/tools/qtwaylandscanner.cpp

https://invent.kde.org/plasma/kwayland-server/commit/ca07c765f50f37bdf5d213234ef6c012af794587
Comment 41 Vlad Zahorodnii 2021-09-27 19:23:28 UTC
Git commit 208b7bb4ed911913ba76b73370310703e0032e81 by Vlad Zahorodnii.
Committed on 27/09/2021 at 19:23.
Pushed by vladz into branch 'Plasma/5.23'.

Automatically close fds for inert resources

If the plasma window resource is inert, the associated get_icon handler
won't be invoked. This results in leaked file descriptors.

With this change, the corresponding close() function call will be
generated to avoid leaking file descriptors and potentially crashing the
session due to kwin running out of fds.


(cherry picked from commit ca07c765f50f37bdf5d213234ef6c012af794587)

M  +7    -0    src/tools/qtwaylandscanner.cpp

https://invent.kde.org/plasma/kwayland-server/commit/208b7bb4ed911913ba76b73370310703e0032e81
Comment 42 Lewis Lakerink 2021-09-27 20:16:50 UTC
*** Bug 438381 has been marked as a duplicate of this bug. ***