510734 – plasmashell and powerdevil crash in libwayland when plugging and unplugging monitor quickly

Bug 510734 - plasmashell and powerdevil crash in libwayland when plugging and unplugging monitor quickly

Summary: plasmashell and powerdevil crash in libwayland when plugging and unplugging m...

Status:	RESOLVED FIXED

Alias:	None

Product:	kwin
Classification:	Plasma
Component:	wayland-generic (other bugs)
Version First Reported In:	6.4.5
Platform:	Fedora RPMs Linux

Importance:	HI crash
Target Milestone:	---
Assignee:	KWin default assignee

URL:
Keywords:	multiscreen

Duplicates (2):	496589 510323 (view as bug list)
Depends on:
Blocks:

Reported:	2025-10-18 03:19 UTC by nyanpasu64
Modified:	2025-10-22 03:32 UTC (History)
CC List:	3 users (show)

See Also:
Latest Commit:	https://invent.kde.org/plasma/kwin/-/commit/655692a787261423b0a801937fb27c838ea6e314
Version Fixed/Implemented In:	6.5.1
Sentry Crash Report:

Attachments
Output of `wayland-debug -r plasmashell 2>&1 \| tee plasmashell-crash-2025-10-17.txt` (914.76 KB, text/plain) 2025-10-18 03:19 UTC, nyanpasu64	Details
Journal log of processes crashing on Plasma 6.5 Beta 2 (6.4.91) (8.89 KB, text/plain) 2025-10-19 07:19 UTC, nyanpasu64	Details
Output of `WAYLAND_DEBUG=1 kcmshell6 kcm_kscreen` showing crashing and non-crashing runs (67.00 KB, application/zip) 2025-10-20 00:21 UTC, nyanpasu64	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description nyanpasu64 2025-10-18 03:19:17 UTC

Created attachment 185876 [details]
Output of `wayland-debug -r plasmashell 2>&1 | tee plasmashell-crash-2025-10-17.txt`

SUMMARY
When I plug and unplug a monitor quickly into my computer, plasmashell and/or powerdevil can crash with error "not a valid new object id (4278190410), message data_offer(n)".

STEPS TO REPRODUCE
1. Quickly plug and unplug a display cable before the kernel can finish reading its EDID information.

I have two monitors, a HDMI cable to a BenQ GL2760H, and a Benfei USBC-to-VGA adapter (from Amazon, dp/B076X2XS9R, IDK the chipset) to a VX720 CRT with overridden EDID.
The CRT feeds through a VGA 4-port switch that only allows me to cycle inputs sequentially, hence when I switch inputs sometimes I connect the CRT to my computer for a split-second. I noticed this would trigger plasmashell crashes.

OBSERVED RESULT
Background: I found that Linux labels the HDMI port as DP-1, and USB-C alt mode (to VGA) as DP-2.

If I briefly plug the CRT (which is not enabled) into the DAC and unplug the VGA side (the DAC remains in the computer), the kernel and kwin can't read the EDID, plasmashell and/or powerdevil crash.
The exact string (eg. data_offer) can change, I've seen data_offer(n), mode(n), and (one time) activation(n).

The message "not a valid new object id" appears to come from wayland#src/connection.c, at https://gitlab.freedesktop.org/wayland/wayland/-/blob/main/src/connection.c?ref_type=heads#L1023-1025. This is not KDE-specific code, but I suspect either kwin is generating invalid messages out of order from a race condition, or plasmashell/powerdevil could be processing Wayland messages incorrectly? In any case I was told to post this bug to plasmashell.

I may be able to debug the code control flow further in the future, after I've cloned wayland and set up my IDE.

(Note that Firefox spews messages "Couldn't map window 0x7f758fe46980 as subsurface" every time you add or remove a screen, even if you let the computer settle.)

EXPECTED RESULT
No crash.

SOFTWARE/OS VERSIONS
Operating System: Fedora Linux 42
KDE Plasma Version: 6.4.5
KDE Frameworks Version: 6.19.0
Qt Version: 6.9.2
Kernel Version: 6.16.11-200.fc42.x86_64 (64-bit)
Graphics Platform: Wayland
Processors: 8 × Intel® Core™ i7-8559U CPU @ 2.70GHz
Memory: 16 GiB of RAM (15.5 GiB usable)
Graphics Processor: Intel® Iris® Plus Graphics 655
Manufacturer: Intel(R) Client Systems
Product Name: NUC8i7BEH
System Version: J72992-303

ADDITIONAL INFORMATION
Previously reported as Bug 508917.

Comment 1 nyanpasu64 2025-10-18 07:43:44 UTC

I looked at the Wayland source code (wl_map_reserve_new()) and https://wayland.freedesktop.org/docs/html/ch04.html.
- It appears that newly constructed object IDs (WL_ARG_NEW_ID) are required to be sent to the client in sequential order. (Presumably each client has their own namespace of IDs, otherwise two clients could allocate the same ID?)
- Application-allocated IDs range from 0 to 0xF00000 (WL_MAP_MAX_OBJECTS), and server-allocated range from 0xFF000000 (WL_SERVER_ID_START) to 0xFFF00000. (Presumably if the server presents an object created by one client to another client, it turns into a server-allocated ID, so each client has unique server-allocated IDs too?)

- In the wayland-debug log, all `=new ` has object IDs in the 4 billions, whereas all new IDs on → lines (created by plasmashell) have object IDs closer to zero.
  - It appears wayland-debug prints a letter after an object ID as a generation counter, rather than hexadecimal. The generations count up to z and wrap around with two characters (like Wi-Fi standards), and the largest one I've found is @557bg!
- I'm guessing that all object IDs of a connection live in a global namespace, some messages have a signature "n" for new ID (https://wayland.freedesktop.org/docs/html/apb.html#Client-structwl__message), and all "new ID"s a client receives from a server (stored in a WL_MAP_CLIENT_SIDE) have to be ≥ 0xFF000000 (4 billion). Though object IDs *created* by a client are not handled by wl_map_reserve_new on the client (only the server receiving the messages, which expects i < WL_SERVER_ID_START and stores messages in a WL_MAP_SERVER_SIDE).

The message "not a valid new object id" appears if the ID is invalid in several ways. I tried using rr to trace the crash, but rr itself failed, so I tried breakpointing the failure site in gdb. Irritatingly Fedora uses LTO on libwayland-client.so.0.24.0 resulting in severe function inlining, to the point wl_display_read_events > read_events > queue_event > wl_connection_demarshal > wl_map_reserve_new is inlined 4 times to a single god function! To get around this, I fed the shared object into Ghidra, identified all jumps to "errno = 0x16" (EINVAL), converted their addresses from Ghidra's base address of 00100000 to gdb's `info proc mappings` of 0x00007ffff7c60000, and set breakpoints conditional on $eflags.

I found that the crash appears to occur when `wl_map_reserve_new() -> if (count < i) {` fails because the server-side ID is greater than the first unused ID (in the log I posted, the new ID 4278190164 is more than 1 greater than the largest ID ever used = 4278190158a!) Just before the crash occurs (and at times beforehand) I see a large number of .destroy() commands. Is this from closing a window?

I suspect the crash happens because kwin's Wayland server allocates a large number of object IDs for a display being plugged in, the display is unplugged so the events never get sent, but the object IDs fail to be freed. Then the server attempts to send kde_output_device_v2@client.mode(mode=new kde_output_device_mode_v2@server) with an invalid object ID (https://wayland.app/protocols/kde-output-device-v2). (Why is plasmashell defining a kde_output_device_v2? it does seem to have emitted several → new kde_output_device_v2, and *received* messages for some of them.) I just wish that libwayland printed *which* object ID on the server sent an invalid message, not just the invalid new ID.

.data_offer(n) appears to come from ext_data_control_device_v1@client. https://wayland.app/protocols/wayland-protocols/336#ext_data_control_device_v1 is unclear and appears related to selections and clipboards? plasmashell might be getting it because of clipboard history?

----
How was the invalid object ID created by the server? My best guess is that kwin calls wl_resource_create() or wl_resource_post/queue_event(_array) -> wl_closure_send/queue -> serialize_closure(closure, buffer, buffer_count), libwayland extracts closure->message->signature and sees a WL_ARG_NEW_ID='n', then takes the wl_argument::n provided by kwin. And kwin screwed up its object ID accounting? I still think this is a kwin (or even kwayland?) bug rather than a plasmashell bug, but plasmashell and powerdevil are the only clients interacting with kde_output_device_v2 to trigger object ID accounting errors. And no, killing plasmashell does not kill applications, killing kwin does.

I'm not sure how kwin outputs Wayland events. Searching around, WaylandOutputDeviceV2 is only referenced from kwin's autotests. kwin#src/wayland/outputdevice_v2.cpp does call send_current_mode(), and OutputDeviceV2InterfacePrivate::kde_output_device_v2_bind_resource() perfectly matches the order seen by clients... just that send_current_mode() seems to be generated code :(

----
Sidenote: `rr plasmashell` crashed on error:
[FATAL src/record_syscall.cc:6733:rec_process_syscall_arch()] 
 (task 130517 (rec:130517) at time 596483)
 -> Assertion `t->regs().syscall_result_signed() == -syscall_state.expect_errno' failed to hold. Expected EINVAL for 'ioctl' but got result 0 (errno SUCCESS); Unknown ioctl(0x81009431): type:0x94 nr:0x31 dir:0x2 size:256 addr:0x7f73bbffe110

This is very Error: Success.
Looks like rr doesn't know how to handle plasmashell's btrfs syscalls?

Comment 2 nyanpasu64 2025-10-18 08:10:23 UTC

https://lists.freedesktop.org/archives/wayland-devel/2014-April/014121.html (2014) says that wl_display has its own queue and messages can be created (unsure sent?) out of order. I'd need someone more familiar with Wayland's concurrency model to figure out what's going on.

Comment 3 David Edmundson 2025-10-18 09:45:40 UTC

If you can test with Plasma 6.5 that would be highly appreciated.

Comment 4 nyanpasu64 2025-10-19 01:13:44 UTC

I've been unable to make a crash happen on a different machine running Arch Linux KDE Beta 6.4.91. I tried installing https://copr.fedorainfracloud.org/coprs/g/kdesig/kde-beta/ on my USB-C machine, but dnf reported broken dependencies on KDE software.

Operating System: Arch Linux 
KDE Plasma Version: 6.4.91
KDE Frameworks Version: 6.19.0
Qt Version: 6.10.0
Kernel Version: 6.17.0-rc7-1-drm-tip-git-gaf3cdefd0a1a (64-bit)
Graphics Platform: Wayland
Processors: 12 × AMD Ryzen 5 5600X 6-Core Processor
Memory: 16 GiB of RAM (15.5 GiB usable)
Graphics Processor: Intel® Arc
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: B550M DS3H

Comment 5 nyanpasu64 2025-10-19 07:19:31 UTC

Created attachment 185895 [details]
Journal log of processes crashing on Plasma 6.5 Beta 2 (6.4.91)

By transplanting my main system's SSD to the NUC with USB-C display output, I was able to reproduce the crash on Plasma 6.5 Beta 2.
This time, *four* processes crashed, kded6, powerdevil, systemsettings (IIRC I had opened it to the Display panel to force full-range RGB?), and kded6.

Operating System: Arch Linux 
KDE Plasma Version: 6.4.91
KDE Frameworks Version: 6.19.0
Qt Version: 6.10.0
Kernel Version: 6.17.3-arch2-1 (64-bit)
Graphics Platform: Wayland
Processors: 8 × Intel® Core™ i7-8559U CPU @ 2.70GHz
Memory: 16 GiB of RAM (15.5 GiB usable)
Graphics Processor: Intel® Iris® Plus Graphics 655
Manufacturer: Intel(R) Client Systems
Product Name: NUC8i7BEH
System Version: J72992-303

Comment 6 nyanpasu64 2025-10-20 00:21:10 UTC

Created attachment 185908 [details]
Output of `WAYLAND_DEBUG=1 kcmshell6 kcm_kscreen` showing crashing and non-crashing runs

I decided to log client communications by running `bash -c 'WAYLAND_DEBUG=1 kcmshell6 kcm_kscreen & echo $!' 2>&1 | tee kcmshell6.log`.

If I rapidly switch monitors, I see many events of form `[2341899.277] discarded [unknown]#4278190117.[event 3](0 fd, 8 byte)`, followed by error `not a valid new object id (4278190164), message mode(n)` and the window closing. The funny thing is that there actually *is* a preceding ID 4278190163, except it only takes the form of a discarded message:

[2352757.126] discarded [unknown]#4278190163.[event 0](0 fd, 16 byte)
[2352757.128] discarded [unknown]#4278190163.[event 1](0 fd, 12 byte)

If I switch monitors slowly (kcmshell6-normal.log), I *do* see discarded messages, but with 8-byte payloads, not directly following a global/bind of "kde_output_device_v2". Oddly if I slowly connect, disconnect, and reconnect the CRT, the modes never reuse an ID but send continually increasing ones to kcmshell5.

## Analysis

By comparing kcmshell6*-fmt.log, I think I have a lead on what's going on. Of course it's a race condition...
I suggest opening the two formatted files in separate editor panes (eg. VS Code) and enabling synchronized/locked scrolling. I'll be commenting on the messages received.

// plugged in a new display!
- The server's registries inform the client we have a new kde_output_device_v2 global, and the client binds it to a new ID. (!!!)
- The server's kde_output_order_v1 tells the client we now have named outputs DP-1 and DP-2. (This is unrelated to the crash.)
- The server's registries inform the client we have a new wl_output global, the client binds it to a new ID, and asks the zxdg_output_manager_v1 to create a new zxdg_output_v1 for the wl_output. (This is unrelated to the crash.)
- Why do we have two wl_registry and we bind the kde_output_device_v2 and wl_output globals to a different one? I don't know.

// remove global kde_output_device_v2 89, global wl_output 90
// destroy id zxdg_output_v1#61, id wl_output#65
// don't destroy id kde_output_device_v2#30.
[2352754.786] {Default Queue} wl_registry#71.global_remove(89)
[2352754.794] {Default Queue} wl_registry#2.global_remove(89)
...
If we unplug the display quickly, we have an unscheduled interruption:
- The server's wl_registries remove the global kde_output_device_v2 and wl_output. (!!!)
- The server's kde_output_order_v1 tells the client we have named output DP-1 only. (This is unrelated to the crash.)
- The server destroys the object IDs for the zxdg_output_v1 and wl_output, but *not* the kde_output_device_v2 (even though the corresponding global is gone!).

// the global kde_output_device_v2 is deleted, but we keep receiving events from its binding,
// and the client doesn't recognize them.
In the normal display connection, the server proceeds to send the client the kde_output_device_v2's resolutions/properties. If we've unplugged the display, the server deletes the kde_output_device_v2 global but keeps sending the client messages from the binding, but WAYLAND_DEBUG can't understand them (and libwayland/wl_map crashes when reserving IDs after the new mode IDs).

// this is supposed to create new id kde_output_device_mode_v2#4278190159, but we don't understand it.
[2352757.103] discarded [unknown]#30.[event 2](0 fd, 12 byte)

// now we receive events from an ID we've never seen before.
// but this isn't an error. *creating* invalid ids is.
- We get unknown messages from kde_output_device_v2 attempting (and failing) to create kde_output_device_mode_v2#4278190159 through 4278190163, along with unknown messages from the modes.

// set current mode
- The disconnected display receives a flurry of distinct event IDs from the kde_output_device_v2 that shouldn't exist.

// delete_id zxdg_output_v1, wl_output (both previously destroyed/released)
[2352757.178] {Display Queue} wl_display#1.delete_id(61)
[2352757.184] {Display Queue} wl_display#1.delete_id(65)
- It's strange that even when *normally* unplugging a monitor, we fail to delete the kde_output_device_v2 and kde_output_device_mode_v2, and can never reuse their IDs. If you open `kcmshell6-normal.log` and search for 4278190082, it's allocated for the LCD's (1920, 1080) mode, never destroyed, and the ID is never reused but instead keeps incrementing. I suspect this ID leak bug could cause issues for long-lived processes like plasmashell.

// a second passes, plug the monitor back in.
not a valid new object id (4278190164), message mode(n)
The Wayland connection experienced a fatal error: Invalid argument

The server created modes with IDs up to 4278190163 and never deleted them. The client never saw those modes. The next time the Wayland server needs to send an object to the client (eg. interactions, monitor changes, clipboard events), it thinks the next unused ID is 4278190164, but the client thinks this ID is invalid because of the gap from the last ID it saw.

## Next Steps

- I'm guessing the bug is that kwin_wayland keeps sending messages from a kde_output_device_v2 binding *after* it's issued a global_remove() to the source object.
    - Why can't the client parse the messages? Did it see the global_remove() and invalidate (forget the interface/type of) all object IDs based on those globals, or were the messages sent from the server corrupted in some way?
    - Is this a bug and how should the client handle it? https://wayland.freedesktop.org/docs/html/apa.html#protocol-spec-wl_registry-event-global_remove says the object IDs remain valid for the *client* to send messages, until the client sees the global_remove and replies by destroying the object. It doesn't say how the server should act.
- I think you should find a way to destroy the IDs allocated to displays and modes so they don't leak infinitely in the client (right now it happens even if you don't crash).

Comment 7 Bug Janitor Service 2025-10-20 05:04:04 UTC

A possibly relevant merge request was started @ https://invent.kde.org/plasma/kwin/-/merge_requests/8274

Comment 8 nyanpasu64 2025-10-20 10:58:12 UTC

Git commit be787d3ac4820e5059164c0436e765935b9c2654 by Tabby Kitten.
Committed on 20/10/2025 at 10:01.
Pushed by meven into branch 'master'.

OutputDeviceV2Interface: guard for global removed in bind

When we plug in a monitor, we register a global kde_output_device_v2 to
clients. If we unplug the monitor and delete the global while a client
is trying to bind it, we would send a client a global_remove message
*before* messages to the bound object ID. This caused Wayland clients to
not recognize the object IDs the server tried sending to the client. The
next time the server tried sending the client an object ID, it would be
greater than the last object ID the client saw + 1, causing the client
app to exit with errors like:
	not a valid new object id (4278190164), message mode(n)
	The Wayland connection experienced a fatal error: Invalid argument

Fix this bug by not sending messages from objects belonging to globals
we've already removed.

This is safe because both binding (from wl_event_loop_dispatch) and
removing outputs (from KWin::DrmBackend::handleUdevEvent) run in the
same main thread.

M  +4    -0    src/wayland/outputdevice_v2.cpp

https://invent.kde.org/plasma/kwin/-/commit/be787d3ac4820e5059164c0436e765935b9c2654

Comment 9 Zamundaaa 2025-10-20 12:21:22 UTC

*** Bug 510323 has been marked as a duplicate of this bug. ***

Comment 10 Zamundaaa 2025-10-20 12:21:29 UTC

*** Bug 496589 has been marked as a duplicate of this bug. ***

Comment 11 Méven 2025-10-20 18:46:50 UTC

Git commit 655692a787261423b0a801937fb27c838ea6e314 by Méven Car.
Committed on 20/10/2025 at 11:22.
Pushed by meven into branch 'Plasma/6.5'.

OutputDeviceV2Interface: guard for global removed in bind

When we plug in a monitor, we register a global kde_output_device_v2 to
clients. If we unplug the monitor and delete the global while a client
is trying to bind it, we would send a client a global_remove message
*before* messages to the bound object ID. This caused Wayland clients to
not recognize the object IDs the server tried sending to the client. The
next time the server tried sending the client an object ID, it would be
greater than the last object ID the client saw + 1, causing the client
app to exit with errors like:
	not a valid new object id (4278190164), message mode(n)
	The Wayland connection experienced a fatal error: Invalid argument

Fix this bug by not sending messages from objects belonging to globals
we've already removed.

This is safe because both binding (from wl_event_loop_dispatch) and
removing outputs (from KWin::DrmBackend::handleUdevEvent) run in the
same main thread.


(cherry picked from commit be787d3ac4820e5059164c0436e765935b9c2654)

Co-authored-by: Tabby Kitten <nyanpasu64@tuta.io>

M  +4    -0    src/wayland/outputdevice_v2.cpp

https://invent.kde.org/plasma/kwin/-/commit/655692a787261423b0a801937fb27c838ea6e314