Bug 487043 - Extreme stutters/hangs when using certain desktop effects when "~/.cache" is on slow storage
Summary: Extreme stutters/hangs when using certain desktop effects when "~/.cache" is ...
Status: RESOLVED FIXED
Alias: None
Product: kwin
Classification: Plasma
Component: effects-various (show other bugs)
Version: 6.0.4
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: KWin default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-15 03:14 UTC by Ritchie Frodomar
Modified: 2024-06-08 02:11 UTC (History)
24 users (show)

See Also:
Latest Commit:
Version Fixed In: 6.1


Attachments
Logs of different Kwin hangs happening from both Kwin's perspective and the kerne'ls. (791.91 KB, application/gzip)
2024-05-15 03:14 UTC, Ritchie Frodomar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ritchie Frodomar 2024-05-15 03:14:19 UTC
Created attachment 169491 [details]
Logs of different Kwin hangs happening from both Kwin's perspective and the kerne'ls.

SUMMARY
If your "~/.cache" directory is stored on slow storage, such as a spinning disk or an LVM pool, using QML-based desktop effects like Tiling Editor and Alt+Tab causes extreme multi-second Kwin hangs.


STEPS TO REPRODUCE
1. Add some kind of slow storage to your system (spinning HDD, LVM pool made of HDDs, slow network filesystem, etc.)
2. Move "~/.cache" to the slow storage medium and symlink it. Alternatively, move your entire /home to the slow storage device.
3. Bring up Tiling Editor with Meta+T.

OBSERVED RESULT
Depending on how slow/busy the storage medium is, Kwin will hang for at least 2 seconds, sometimes up to 15 in really bad cases. During this hang, the system is completely unresponsive - no mouse or keyboard input whatsoeever, and if the hang is long enough, Kwin will warn in the logs about DRM pageflips taking too long.

EXPECTED RESULT
The system should stay responsive and Kwin shouldn't hang, even if opening Tiling Editor takes slightly longer.

SOFTWARE/OS VERSIONS
Linux: 6.8.9-arch1-2 (64-bit)
KDE Plasma Version: 6.0.4
KDE Frameworks Version: 6.2.0
Qt Version: 6.7.0

ADDITIONAL INFORMATION
So far, known-affected effects are:

 - Tiling Editor (meta+T)
 - Window Overview (meta+W)
- Alt+Tab, when Alt is held down (which brings up the window switcher menu)

It seems to be any effect that uses Qt QML, and other than those three, I haven't personally tested many.

This is also distro-independent. Other users than myself have reported the exact same hangs occurring on their system, with the common configuration being their home directory being stored on slow storage.

I have attached three logs that show the issue happening. One of them is of what Kwin sees when a hang happens. The two dmesg logs are with extreme verbose DRM logging enabled, one with two screens plugged in and one with one screen. These were captured with Xaver Hugil's help, before I suspected it was disk-related, however maybe there's something useful in there.
Comment 1 Ritchie Frodomar 2024-05-15 04:07:24 UTC
Wanted to add additional info - I did mention "slow and busy" drives, but I should clarify that the busy-ness of the drive absolutely does affect the hanging. You can observe this even on a moderately-fast spinning disk by writing lots of data to it (like using dd if=/dev/urandom of=wherever bs=somethinghuge).
Comment 2 devminer 2024-05-21 21:18:26 UTC
This solved the Alt+Tab animations being slow and hanging for me, even though I already have a sort-of fast PCIe gen 3 SSD.

The data loaded from disk on every Alt+Tab seems small enough to just stay in memory, so I would appreciate if the effects would just stay loaded in, *especially* for something like Alt+Tab.
Comment 3 Oded Arbel 2024-05-21 22:11:25 UTC
If I understand correctly, the problem is that kwin
Comment 4 Nate Graham 2024-05-21 23:56:46 UTC
This is probably an issue we should fix for the benefit of people with a single slow storage disk, but I have to ask... if you're a technical expert and have multiple options for storage hardware options, would it mot make more sense to put the cache on your fastest one, rather than a known slow one?
Comment 5 Mithras 2024-05-22 00:13:39 UTC
Is it throughout whole .cache folder or a specific subfolder? Any chance I can mount tmpfs for these animation only?
Comment 6 Brodie Robertson 2024-05-22 02:38:28 UTC
(In reply to Nate Graham from comment #4)
> This is probably an issue we should fix for the benefit of people with a
> single slow storage disk, but I have to ask... if you're a technical expert
> and have multiple options for storage hardware options, would it mot make
> more sense to put the cache on your fastest one, rather than a known slow
> one?

In my case I have my boot and application data on my NVME drive as this has the most noticeable effect on system speeds, but my home directory is on my bulk storage hard drive. SSDs are much cheaper now and on my next upgrade I will probably make the swap but even if I do move to something faster it doesn't change the issue that Kwin is constantly adding wear to the drive with little pings every time a desktop effect is used.
Comment 7 Brodie Robertson 2024-05-22 02:40:43 UTC
Additionally Plasma seems to be the only software that is this negatively effected by the cache existing on a hard drive, GNOME, Hyprland and other desktops do not seem to exhibit similar issues.
Comment 8 Brodie Robertson 2024-05-22 02:42:30 UTC
(In reply to Mithras from comment #5)
> Is it throughout whole .cache folder or a specific subfolder? Any chance I
> can mount tmpfs for these animation only?

The folder at fault is .cache/kwin/qmlcache from what I can tell, at least from early testing putting this on a tmpfs seems to not cause additional issues.
Comment 9 Victoria 2024-05-22 05:53:08 UTC
Can this also be the reason why Plasma can sometimes completely freeze during heavy IO (like downloading a big game on Steam)?
Comment 10 Nate Graham 2024-05-22 15:45:04 UTC
(In reply to Brodie Robertson from comment #7)
> Additionally Plasma seems to be the only software that is this negatively
> effected by the cache existing on a hard drive, GNOME, Hyprland and other
> desktops do not seem to exhibit similar issues.
For sure, we can and will fix this. People have complained about similar issues when using HDDs for years.

I'm just saying that if you have faster storage available, and you're a technical expert enough to put things on different disks according to their performance characteristics, it would make sense to put the cache folder on a performant disk too, until this is fixed (and probably even after this is fixed).
Comment 11 Gonçalo Negrier Duarte 2024-05-22 16:16:27 UTC
(In reply to Nate Graham from comment #10)
> (In reply to Brodie Robertson from comment #7)
> > Additionally Plasma seems to be the only software that is this negatively
> > effected by the cache existing on a hard drive, GNOME, Hyprland and other
> > desktops do not seem to exhibit similar issues.
> For sure, we can and will fix this. People have complained about similar
> issues when using HDDs for years.
> 
> I'm just saying that if you have faster storage available, and you're a
> technical expert enough to put things on different disks according to their
> performance characteristics, it would make sense to put the cache folder on
> a performant disk too, until this is fixed (and probably even after this is
> fixed).

I never was affected by this problems since my laptop have two SSD, but the constant Read/Write is making unecessary wear on the driver.
I also notice that are other directorys/files for plasma cache outside of the affected `~/.cache/kwin/qmlcache`  (but I could be missing more):
```bash
~/.cache/KDE/
~/.cache/plasmashell/
~/.cache/plasma-systemmonitor/
~/.cache/plasma_theme_internal-system-colors.kcache
~/.cache/plasma_theme_default.kcache
```
This files are small in size, so will not be a good ideia to use /tmp directory instead for this type of cache?
I dont know if the kde cache could grow when using other kde plugins, but I dont really think anyone is running plasma 5 and up on system with less that 2Gb of ram
Comment 12 Ritchie Frodomar 2024-05-22 18:11:28 UTC
(In reply to Nate Graham from comment #10)
> (In reply to Brodie Robertson from comment #7)
> > Additionally Plasma seems to be the only software that is this negatively
> > effected by the cache existing on a hard drive, GNOME, Hyprland and other
> > desktops do not seem to exhibit similar issues.
> For sure, we can and will fix this. People have complained about similar
> issues when using HDDs for years.
> 
> I'm just saying that if you have faster storage available, and you're a
> technical expert enough to put things on different disks according to their
> performance characteristics, it would make sense to put the cache folder on
> a performant disk too, until this is fixed (and probably even after this is
> fixed).

In my case, when I first discovered this issue my OS was on NVMe with my home being on an LBM pool (two 4TB HDDs + one 1TB NVMe cache) forming an 8TB volume.

Plasma is the only thing on my system that struggles with that configuration, and I don't store anything else in /home that would benefit from an SSD more than it would the extra space. 8 TB worth of SSDs is still prohibitively expensive for me, and not worth it for what I store on them. Even with gaming, the little extra load time doesn't bother me.
Comment 13 Zix 2024-05-24 19:20:50 UTC
(In reply to Victoria from comment #9)
> Can this also be the reason why Plasma can sometimes completely freeze
> during heavy IO (like downloading a big game on Steam)?

I apologize for making a post unrelated to the bug at hand, but I believe that this kind of lockup is related to the system's dirty_bytes settings. If you are using a system that uses the default Linux Kernel values. like Arch Linux, and have a lot of memory, the default values can cause huge blocking synchronous writes. Try looking at the following for guidance:

https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
https://github.com/pop-os/default-settings/blob/2d38675f21a190d6ab2794b8961339c28ce0f7dd/etc/sysctl.d/10-pop-default-settings.conf
https://wiki.archlinux.org/title/sysctl#Small_periodic_system_freezes
Comment 14 Fabian Vogt 2024-05-28 19:22:36 UTC
I tried to simulate a very slow .cache using the delay device mapper module:

> dd if=/dev/zero of=cachelp bs=1M count=64
> /sbin/mkfs.ext4 cachelp
> sudo losetup -f cachelp
> echo "0 $(sudo blockdev --getsz /dev/loop0) delay /dev/loop0 0 500)" | sudo dmsetup create delayed
> sudo mount /dev/mapper/delayed ~testuser/.cache
> sudo chown testuser /home/testuser/.cache

Then I ran as testuser

> dbus-run-session gdb --args kwin_wayland --x11-display $DISPLAY --exit-with-session /usr/bin/konsole

Moved the window around a bit and maximised it, to make sure caches (.qmlc,  ksysoca) got generated. Then I restarted kwin and maximised the konsole window and it lagged. Backtrace:

Thread 1 (Thread 0x7f515559bb00 (LWP 31711) "kwin_wayland"):
#0  0x00007f5158508daa in fdatasync () at /lib64/libc.so.6
#1  0x00007f5158d65d63 in QLockFile::tryLock(std::chrono::duration<long, std::ratio<1l, 1000l> >) () at /lib64/libQt6Core.so.6
#2  0x00007f515b167fd0 in QSGRhiSupport::preparePipelineCache(QRhi*, QQuickWindow*) () at /lib64/libQt6Quick.so.6
#3  0x00007f515b168d0c in QSGRhiSupport::createRhi(QQuickWindow*, QSurface*, bool) () at /lib64/libQt6Quick.so.6
#4  0x00007f515b14a763 in  () at /lib64/libQt6Quick.so.6
#5  0x00007f515b14c73d in  () at /lib64/libQt6Quick.so.6
#6  0x00007f515963d7e9 in QWindow::event(QEvent*) () at /lib64/libQt6Gui.so.6
#7  0x00007f5159fc2f1e in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /lib64/libQt6Widgets.so.6
#8  0x00007f5158d8f030 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /lib64/libQt6Core.so.6
#9  0x00007f51595eda8b in QGuiApplicationPrivate::processExposeEvent(QWindowSystemInterfacePrivate::ExposeEvent*) () at /lib64/libQt6Gui.so.6
#10 0x00007f515964c05c in QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib64/libQt6Gui.so.6
#11 0x00007f515964c1e7 in QWindowSystemInterface::flushWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /lib64/libQt6Gui.so.6
#12 0x00007f5159627340 in QPlatformWindow::setVisible(bool) () at /lib64/libQt6Gui.so.6
#13 0x00007f515aa839b4 in  () at /lib64/libQt6Qml.so.6
#14 0x00007f515aa97936 in  () at /lib64/libQt6Qml.so.6
#15 0x00007f515aa9613d in QQmlBinding::doUpdate(QQmlJavaScriptExpression::DeleteWatcher const&, QFlags<QQmlPropertyData::WriteFlag>, QV4::Scope&) () at /lib64/libQt6Qml.so.6
#16 0x00007f515aa94084 in QQmlBinding::update(QFlags<QQmlPropertyData::WriteFlag>) () at /lib64/libQt6Qml.so.6
#17 0x00007f515ab096c8 in QQmlNotifier::emitNotify(QQmlNotifierEndpoint*, void**) () at /lib64/libQt6Qml.so.6
#18 0x00007f5158de7e88 in  () at /lib64/libQt6Core.so.6
#19 0x00007f515ba2479d in KWin::Window::setElectricBorderMaximizing(bool) () at /lib64/libkwin.so.6

Apparently something in Qt's Rhi has a lock file in .cache which needs to be synced to disk (/home/testuser/.cache/kwin/qtpipelinecache-x86_64-little_endian-lp64/qqpc_opengl.lck). Fortunately it's only a fdatasync and not a fsync or even sync so it only waits until that specific file data has made it to disk, but it can still block for a while.

Maybe there's a way to avoid QLockFile use there.
Comment 15 Méven Car 2024-05-29 09:03:04 UTC
(In reply to Fabian Vogt from comment #14)
> I tried to simulate a very slow .cache using the delay device mapper module:
> 
> > dd if=/dev/zero of=cachelp bs=1M count=64
> > /sbin/mkfs.ext4 cachelp
> > sudo losetup -f cachelp
> > echo "0 $(sudo blockdev --getsz /dev/loop0) delay /dev/loop0 0 500)" | sudo dmsetup create delayed
> > sudo mount /dev/mapper/delayed ~testuser/.cache
> > sudo chown testuser /home/testuser/.cache
> 
> Then I ran as testuser
> 
> > dbus-run-session gdb --args kwin_wayland --x11-display $DISPLAY --exit-with-session /usr/bin/konsole
> 
> Moved the window around a bit and maximised it, to make sure caches (.qmlc, 
> ksysoca) got generated. Then I restarted kwin and maximised the konsole
> window and it lagged. Backtrace:
> 
> Thread 1 (Thread 0x7f515559bb00 (LWP 31711) "kwin_wayland"):
> #0  0x00007f5158508daa in fdatasync () at /lib64/libc.so.6
> #1  0x00007f5158d65d63 in QLockFile::tryLock(std::chrono::duration<long,
> std::ratio<1l, 1000l> >) () at /lib64/libQt6Core.so.6
> #2  0x00007f515b167fd0 in QSGRhiSupport::preparePipelineCache(QRhi*,
> QQuickWindow*) () at /lib64/libQt6Quick.so.6
> #3  0x00007f515b168d0c in QSGRhiSupport::createRhi(QQuickWindow*, QSurface*,
> bool) () at /lib64/libQt6Quick.so.6
> #4  0x00007f515b14a763 in  () at /lib64/libQt6Quick.so.6
> #5  0x00007f515b14c73d in  () at /lib64/libQt6Quick.so.6
> #6  0x00007f515963d7e9 in QWindow::event(QEvent*) () at /lib64/libQt6Gui.so.6
> #7  0x00007f5159fc2f1e in QApplicationPrivate::notify_helper(QObject*,
> QEvent*) () at /lib64/libQt6Widgets.so.6
> #8  0x00007f5158d8f030 in QCoreApplication::notifyInternal2(QObject*,
> QEvent*) () at /lib64/libQt6Core.so.6
> #9  0x00007f51595eda8b in
> QGuiApplicationPrivate::processExposeEvent(QWindowSystemInterfacePrivate::
> ExposeEvent*) () at /lib64/libQt6Gui.so.6
> #10 0x00007f515964c05c in
> QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::
> ProcessEventsFlag>) () at /lib64/libQt6Gui.so.6
> #11 0x00007f515964c1e7 in
> QWindowSystemInterface::flushWindowSystemEvents(QFlags<QEventLoop::
> ProcessEventsFlag>) () at /lib64/libQt6Gui.so.6
> #12 0x00007f5159627340 in QPlatformWindow::setVisible(bool) () at
> /lib64/libQt6Gui.so.6
> #13 0x00007f515aa839b4 in  () at /lib64/libQt6Qml.so.6
> #14 0x00007f515aa97936 in  () at /lib64/libQt6Qml.so.6
> #15 0x00007f515aa9613d in
> QQmlBinding::doUpdate(QQmlJavaScriptExpression::DeleteWatcher const&,
> QFlags<QQmlPropertyData::WriteFlag>, QV4::Scope&) () at /lib64/libQt6Qml.so.6
> #16 0x00007f515aa94084 in
> QQmlBinding::update(QFlags<QQmlPropertyData::WriteFlag>) () at
> /lib64/libQt6Qml.so.6
> #17 0x00007f515ab096c8 in QQmlNotifier::emitNotify(QQmlNotifierEndpoint*,
> void**) () at /lib64/libQt6Qml.so.6
> #18 0x00007f5158de7e88 in  () at /lib64/libQt6Core.so.6
> #19 0x00007f515ba2479d in KWin::Window::setElectricBorderMaximizing(bool) ()
> at /lib64/libkwin.so.6
> 
> Apparently something in Qt's Rhi has a lock file in .cache which needs to be
> synced to disk
> (/home/testuser/.cache/kwin/qtpipelinecache-x86_64-little_endian-lp64/
> qqpc_opengl.lck). Fortunately it's only a fdatasync and not a fsync or even
> sync so it only waits until that specific file data has made it to disk, but
> it can still block for a while.
> 
> Maybe there's a way to avoid QLockFile use there.

Relevant documentation:
https://doc.qt.io/qt-5/qmldiskcache.html

Another workaround would be to use `export QML_DISK_CACHE_PATH=/tmp/qmlcache` where /tmp/qmlcache is a fast drive or a ramdisk.

For some reason (that I don't know but there is probably one) we don't use `qt_add_qml_module` in plasma and kwin, that does pre-compile the js/qml/C++-binding and embedded it in the executable resources:
https://doc.qt.io/qt-6/qt-add-qml-module.html#caching-compiled-qml-sources

We do use it in applications (spectacle, neochat...).
Comment 16 Fabian Vogt 2024-05-29 09:15:25 UTC
(In reply to Méven Car from comment #15)
> (In reply to Fabian Vogt from comment #14)
> > Apparently something in Qt's Rhi has a lock file in .cache which needs to be
> > synced to disk
> > (/home/testuser/.cache/kwin/qtpipelinecache-x86_64-little_endian-lp64/
> > qqpc_opengl.lck). Fortunately it's only a fdatasync and not a fsync or even
> > sync so it only waits until that specific file data has made it to disk, but
> > it can still block for a while.
> > 
> > Maybe there's a way to avoid QLockFile use there.
> 
> Relevant documentation:
> https://doc.qt.io/qt-5/qmldiskcache.html
> 
> Another workaround would be to use `export
> QML_DISK_CACHE_PATH=/tmp/qmlcache` where /tmp/qmlcache is a fast drive or a
> ramdisk.
> 
> For some reason (that I don't know but there is probably one) we don't use
> `qt_add_qml_module` in plasma and kwin, that does pre-compile the
> js/qml/C++-binding and embedded it in the executable resources:
> https://doc.qt.io/qt-6/qt-add-qml-module.html#caching-compiled-qml-sources
> 
> We do use it in applications (spectacle, neochat...).

The QML disk cache is completely unrelated.

This is the Qt Rhi pipeline cache, which is (unconditionally) enabled by QtQuick: https://github.com/qt/qtdeclarative/blob/d9ceacd4126ff48d5f1e6d5b7b0f3be426d2cf35/src/quick/scenegraph/qsgrhisupport.cpp#L941
Comment 17 David Edmundson 2024-05-29 11:54:46 UTC
strace for me showed second invocation did not touch the QML cache files, as expected.

RHI pipeline was hit a lot.

RHI pipeline cache can be disabled, see:
https://doc.qt.io/qt-6/qquickgraphicsconfiguration.html#the-automatic-pipeline-cache

There are env vars for easy profiling.
Comment 18 David Edmundson 2024-05-29 11:56:59 UTC
Interetsingly there is a line that the automatic pipeline cache naming doesn't handle the same UI in mulitple windows in the same app very well, that's something we use in kwin. So there's definitely something to tweak
Comment 19 David Edmundson 2024-05-29 12:16:18 UTC
I ran with

QCoreApplication::setAttribute(Qt::AA_DisableShaderDiskCache); in main.cpp and it's certainly not visibly worse. 

Our pipelines aren't exactly complicated, and mesa has a cache anyway, maybe it isn't worth it.
Comment 20 David Edmundson 2024-05-29 12:23:33 UTC
Turning this into some actual numbers, total time in pipeline creation for both the overview and cube effect with the cache off gave the result:

"Total time spent on pipeline creation during the lifetime of the QRhi 0x576293504b00 was 0 ms" on a 5 year old Intel laptop.
Comment 21 David Edmundson 2024-05-29 12:42:32 UTC
>Maybe there's a way to avoid QLockFile use there.

I think so. From my reading of the code it can all be guarded with 
#if !QT_CONFIG(temporaryfile)

The saving *sometimes* uses QSaveFile which writes to another file and does an atomic move depending on build flags, I think we only need the lock file when not using that.
Comment 22 Bug Janitor Service 2024-05-30 15:40:21 UTC
A possibly relevant merge request was started @ https://invent.kde.org/plasma/kwin/-/merge_requests/5802
Comment 23 Vlad Zahorodnii 2024-05-31 15:28:13 UTC
Git commit f700de56f8ae6c15c7fd7d18bdaf47bf1e21b219 by Vlad Zahorodnii, on behalf of David Edmundson.
Committed on 31/05/2024 at 15:28.
Pushed by vladz into branch 'master'.

core: Disable Qt RHI pipeline cache

The Qt pipeline cache causes a disk sync on every load and save of a
QQuickWindow. This causes a stutter under high disk usage.

The gains from this cache are minimal on our simple scenes on PC
hardware. Especially given mesa has it's own cache, profiling on my
personal laptop showed the pipeline as being 0ms.

There is an upstream patch at
https://codereview.qt-project.org/c/qt/qtdeclarative/+/564411 .
QSaveFile still has a sync, but that should only be hit for the first
non-cached run. I'm also adding a flag to QSaveFile to fix the QML cache
and first run case. 

Tested via running kwin with `strace -e
inject=fdatasync:delay_enter=10000000` to simulate a slow flush.

M  +5    -0    src/main_wayland.cpp
M  +5    -0    src/main_x11.cpp

https://invent.kde.org/plasma/kwin/-/commit/f700de56f8ae6c15c7fd7d18bdaf47bf1e21b219
Comment 24 Vlad Zahorodnii 2024-05-31 18:17:12 UTC
Git commit c747f9c3a7cb9aab74ea07f38eb5de43feb06a2c by Vlad Zahorodnii.
Committed on 31/05/2024 at 18:06.
Pushed by vladz into branch 'Plasma/6.1'.

core: Disable Qt RHI pipeline cache

The Qt pipeline cache causes a disk sync on every load and save of a
QQuickWindow. This causes a stutter under high disk usage.

The gains from this cache are minimal on our simple scenes on PC
hardware. Especially given mesa has it's own cache, profiling on my
personal laptop showed the pipeline as being 0ms.

There is an upstream patch at
https://codereview.qt-project.org/c/qt/qtdeclarative/+/564411 .
QSaveFile still has a sync, but that should only be hit for the first
non-cached run. I'm also adding a flag to QSaveFile to fix the QML cache
and first run case. 

Tested via running kwin with `strace -e
inject=fdatasync:delay_enter=10000000` to simulate a slow flush.


(cherry picked from commit f700de56f8ae6c15c7fd7d18bdaf47bf1e21b219)

ac5aeb67 core: Disable Qt RHI pipeline cache

Co-authored-by: David Edmundson <kde@davidedmundson.co.uk>

M  +5    -0    src/main_wayland.cpp
M  +5    -0    src/main_x11.cpp

https://invent.kde.org/plasma/kwin/-/commit/c747f9c3a7cb9aab74ea07f38eb5de43feb06a2c