Bug 440386 - High CPU use of kwin_wayland when video playing in firefox
Summary: High CPU use of kwin_wayland when video playing in firefox
Status: ASSIGNED
Alias: None
Product: kwin
Classification: Plasma
Component: wayland-generic (show other bugs)
Version: 5.23.5
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: KWin default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-29 14:17 UTC by Fabian Vogt
Modified: 2024-03-18 21:59 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Vogt 2021-07-29 14:17:53 UTC
When using firefox in a wayland session and playing a video on youtube, kwin_wayland takes a constant ~75% CPU (single core) and Firefox is at ~25%.
It does not make a difference whether Firefox is running as Wayland or X11 client.

This slows applications like plasmashell down massively, to the point of being unusable.

This issue does not happen in a X11 session.
Comment 1 Fabian Vogt 2021-07-31 13:58:12 UTC
It does not happen with either "KWIN_COMPOSE=Q" or when using the framebuffer backend (which also forces QPainter).

I noticed that even though the refresh rate is set to 60Hz, kwin actually renders 100fps according to the Show FPS effect. When idle, the default latency configuration substracts 0.5*vblankInterval (8ms) from nextRenderTimestamp, which then ends up at ~8ms delay between renders. As a test, I removed the substraction and ended up with ~35fps in firefox. While it undershot the FPS target quite a bit (as expected), CPU use is down to ~55% that way.

According to "perf", a lot of time is spent in texture format conversion functions, because Mesa defaults to using RGBA8, but kwin's calls glTexImage2D/glTexSubImage2D with GL_BGRA. Sometimes there's also a mismatch between RGB8 and RGBA8, when the texture was created using a QImage without alpha, but updates come from a source with an alpha channel. For quick testing, I just hardcoded GL_RGBA in those places to avoid the conversion. With that, CPU% is down from ~75% to ~65%, most of the savings are probably eaten up by giving room for a higher frame rate.

The next Mesa version has an optimized render path for 2d rasterization in llvmpipe, so I tried with that as well. After I confirmed that the fast path was actually used, I saw about a ~3% decrease in CPU utilization of kwin again.

FWICT, the high CPU utilization is mostly from having a higher target frame rate than is actually displayed, which makes kwin_wayland fully CPU bound.
Comment 2 Fabian Vogt 2021-08-01 12:53:48 UTC
> FWICT, the high CPU utilization is mostly from having a higher target frame rate than is actually displayed, which makes kwin_wayland fully CPU bound.

I somehow totally forgot that this was with the Show FPS effect enabled, which causes kwin to render as fast as possible, even if no client actually requests a repaint. By counting how often RenderLoop::endFrame is called per second instead (not being that familiar with kwin, I'm not entirely sure that's the right place either), the real repaints per second are visible.
With a video in a single firefox window this averages at ~55, so this isn't actually the cause in this specific scenario. With multiple windows or just moving the cursor it does reach ~172 though.

After eliminating the hot spot in AbstractEglTexture::createTextureSubImage (apparently FF and Xwayland both use shm), perf shows that most of the time is spent in llvmpipe.

What I do not understand so far is why this is not the case in an X11 session (~24% CPU), not even inside kwin_wayland --x11-display $DISPLAY (kwin_x11 25%, kwin_wayland 9.6%). It only happens when using kwin_wayland --drm.

Maybe swrast has some issues when the render target is DRM?
Comment 3 Nate Graham 2021-08-03 21:49:31 UTC
Cannot reproduce; kwin_wayland hovers at around 12.8% (single core) while a video is playing in Firefox on my 4k screen in the Plasma Wayland session.
Comment 4 Fabian Vogt 2021-08-04 06:54:39 UTC
I also tried playing a video in FF in Weston using the DRM backend and CPU usage is ~25%. I had a look at how weston deals with received buffers and copying damaged regions into local textures and it uses the same unpack mechanism as kwin on the master branch meanwhile. Even running kwin master, CPU usage is at ~75% just for showing the spinning circle loading animation.

(In reply to Nate Graham from comment #3)
> Cannot reproduce; kwin_wayland hovers at around 12.8% (single core) while a
> video is playing in Firefox on my 4k screen in the Plasma Wayland session.

Apparently only kwin_wayland running in DRM mode in VMs without 3d acceleration is affected. I'm suspecting something related to kms_swrast.so at this point, but I'm not sure. You could try LIBGL_ALWAYS_SOFTWARE=1 startplasma-wayland, I haven't actually tried that on real HW yet.
Comment 5 Fabian Vogt 2021-08-05 13:39:33 UTC
I did some more comparisions with weston.

perf shows that while weston runs into the "fs_variant_whole" path in llvmpipe, kwin_wayland almost exclusively uses the slower "fs_variant_partial" functions.
With LP_DEBUG=counters, it's visible that weston triggers this because it renders mostly opaque triangles, while kwin doesn't have any of them. With apitrace, it's visible that Weston does rendering with GL_BLEND disabled, but kwin_wayland has it enabled. While the surface(s) used by firefox are all RGBA, it calls "set_opaque_region" to make them totally opaque:

[4161028,247]  -> xdg_surface@45.set_window_geometry(0, 0, 1920, 982)
[4161028,579]  -> wl_compositor@4.create_region(new id wl_region@54)
[4161028,740]  -> wl_region@54.add(0, 0, 1920, 982)
[4161029,064]  -> wl_surface@41.set_opaque_region(wl_region@54)
[4161029,197]  -> wl_region@54.destroy()

KWin seems to ignore this. With disabled blending, kwin uses "fs_variant_whole" as well and CPU usage goes down ~10%.

What I also noticed is that firefox uses two surfaces for rendering. The main surface is completely empty (all transparent) and only its subsurface has content. While weston draws a single quad, kwin renders both. This doesn't seem to make a big difference, it's probably discarded early on.

My current theory about the huge difference between DRM and other platforms is that access to the mapped framebuffer is just slow (for blending that hits twice), and on top of that it's mapped and unmapped for each frame, which causes page faults.
Comment 6 Vlad Zahorodnii 2021-08-05 13:47:19 UTC
(In reply to Fabian Vogt from comment #5)
> I did some more comparisions with weston.
> 
> perf shows that while weston runs into the "fs_variant_whole" path in
> llvmpipe, kwin_wayland almost exclusively uses the slower
> "fs_variant_partial" functions.
> With LP_DEBUG=counters, it's visible that weston triggers this because it
> renders mostly opaque triangles, while kwin doesn't have any of them. With
> apitrace, it's visible that Weston does rendering with GL_BLEND disabled,
> but kwin_wayland has it enabled. While the surface(s) used by firefox are
> all RGBA, it calls "set_opaque_region" to make them totally opaque:
> 
> [4161028,247]  -> xdg_surface@45.set_window_geometry(0, 0, 1920, 982)
> [4161028,579]  -> wl_compositor@4.create_region(new id wl_region@54)
> [4161028,740]  -> wl_region@54.add(0, 0, 1920, 982)
> [4161029,064]  -> wl_surface@41.set_opaque_region(wl_region@54)
> [4161029,197]  -> wl_region@54.destroy()
> 
> KWin seems to ignore this. With disabled blending, kwin uses
> "fs_variant_whole" as well and CPU usage goes down ~10%.
> 
> What I also noticed is that firefox uses two surfaces for rendering. The
> main surface is completely empty (all transparent) and only its subsurface
> has content. While weston draws a single quad, kwin renders both. This
> doesn't seem to make a big difference, it's probably discarded early on.

The topmost subsurface contains the client-side drop-shadow, but yeah kwin's
clipping algorithm doesn't work well with sub-surfaces right now (I hope to
change that with the scene redesign).

> My current theory about the huge difference between DRM and other platforms
> is that access to the mapped framebuffer is just slow (for blending that
> hits twice), and on top of that it's mapped and unmapped for each frame,
> which causes page faults.
Comment 7 Bug Janitor Service 2021-08-07 12:56:38 UTC
A possibly relevant merge request was started @ https://invent.kde.org/plasma/kwin/-/merge_requests/1229
Comment 8 Fabian Vogt 2021-08-09 09:54:48 UTC
Git commit 11c7e7a64da3b39c89a1392f1f1112bc8c8ea8b7 by Fabian Vogt.
Committed on 09/08/2021 at 06:52.
Pushed by fvogt into branch 'master'.

scenes/opengl: Avoid blending for entirely opaque SurfaceItems

Blending is quite expensive especially with software rendering.
In the case of Firefox on Wayland, it uses a ARGB8888 buffer but marks the
entire surface as opaque, so the alpha channel can be ignored.

M  +3    -1    src/plugins/scenes/opengl/scene_opengl.cpp

https://invent.kde.org/plasma/kwin/commit/11c7e7a64da3b39c89a1392f1f1112bc8c8ea8b7
Comment 9 Sergey 2021-10-21 19:41:53 UTC
Google Chrome now suffers more from this problem consuming almost twice as much of Firefox. 14 (FF) and 26% (GC) on my MSI GL75 9SDK laptop.

But if I monitor the CPU on semi-transparent Konsole with htop then it goes beyond 200%