Bug 372991 - Terminal gets stuck on interrupting a program that is outputting, preventing further output from being shown
Summary: Terminal gets stuck on interrupting a program that is outputting, preventing ...
Status: RESOLVED UPSTREAM
Alias: None
Product: konsole
Classification: Applications
Component: general (show other bugs)
Version: master
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Konsole Developer
URL:
Keywords:
: 376462 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-11-27 11:31 UTC by Peter Wu
Modified: 2017-06-17 12:07 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
perf trace log (executing xxd, interrupt, xxd, interrupt (=hang), ls) (392.60 KB, application/x-gzip)
2016-11-27 11:31 UTC, Peter Wu
Details
debug patch (1.65 KB, patch)
2017-02-18 01:43 UTC, Peter Wu
Details
smaller kernel config that is able to boot archiso (53.46 KB, text/x-mpsub)
2017-03-26 23:12 UTC, Peter Wu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Wu 2016-11-27 11:31:01 UTC
Created attachment 102469 [details]
perf trace log (executing xxd, interrupt, xxd, interrupt (=hang), ls)

When interrupting a program (^C), sometimes the output in Konsole is no longer being updated.

Steps to reproduce:
1. Open Konsole (a new window, tab, etc.)
2. Execute "xxd < /dev/urandom"
3. Press ^C after a few seconds.
4. Repeat 2-3 a few times until the output is no longer updating.

Expected results:
Terminal should always keep the output updated, showing the prompt when the program is interrupted. Example output:

00011310: e015 2350 cb8e dba9 00f2 8b36 a030 4b6a  ..#P.......6.0Kj
00011320: 1748 a332 bb9f 07d2 dffe 0098 a236 ad69  ^C
peter@al:/tmp$ [cursor]


Actual results:
Terminal output stops updating. Example output:

000125a0: ab6d dbb7 f02f 8288 2f94 e893 1469 8190  .m.../../....i..
000125b0: 7413 52ce c39e 2843 aeb9 d423 57c3 413b  t.R...(C...#W.A;
000125c0: 3e6a 4f8d 5c88 d34f e68[cursor]

Input is still accepted and the terminal in fact still functions, except that the output is not being updated (the cursor position does not advance).
If you execute "touch x" for example, the file "x" will be created.
The tab title is also updated (execute "sleep 10" for example).
The output is still not updated in all these cases:
* "Saving Output As..." also shows the same.
* "Clear Scrollback" removes scrollback, but that's it.
* "Clear Scrollback and Reset" wipes a lot and moves the output, but that's it.


Other details:
As developer I have this issue for some months now and it does not always require a lot of output (it also happened when interrupting the "ninja" build command).

Reproduction under strace failed, but "perf trace -o st.txt src/konsole" was successful to reproduce it.

Environment details:
- Hardware: laptop (CPU i7-6700HQ, 32GB RAM).
- konsole 16.08.3-1 (Arch Linux packages)
- konsole v16.08.2-83-g378518e (built with ASAN)
- qt5-base v5.7.0-6 (also reproduced when built with ASAN)
Comment 1 Peter Wu 2016-11-27 11:35:20 UTC
Other reproducers:

xxd /dev/urandom
yes | nl
Comment 2 Patrick Rudolph 2017-01-11 08:15:46 UTC
I can easily reproduce this issue as descripted by Peter.
It got stuck 3 times out of 10 test runs.
The problem only affects a single tab. All other tabs in the same konsole remain working.

* konsole-16.11.90-2.1 (SUSE backport)
* qtbase-5.6.1-8.1 (SLES12)
* plasma5-desktop-5.8.4-1.1 (SUSE backport)
Comment 3 Kurt Hindenburg 2017-02-12 15:48:07 UTC
well that's interesting... I'll try to research
Comment 4 Hello71 2017-02-15 00:59:57 UTC
*** Bug 376462 has been marked as a duplicate of this bug. ***
Comment 5 Hello71 2017-02-15 01:06:37 UTC
this is clearly some type of race condition, since it only breaks sometimes.

you can try "taskset -c 0 konsole", which for me fixes the problem.
Comment 6 Kurt Hindenburg 2017-02-16 15:48:40 UTC
From what I can tell, KDE4 Konsole doesn't have this issue.  It appears the conversion to KF5 introduced this.
Comment 7 Egmont Koblinger 2017-02-17 23:34:37 UTC
I can second Hello71's observation: "taskset -c 0 konsole" works perfectly for me. Looks like a race condition around multiple threads, or something like this.
Comment 8 Peter Wu 2017-02-18 01:43:16 UTC
Created attachment 104086 [details]
debug patch

I have traced the issue into kpty, KPtyDevicePrivate::_k_canRead.

The ioctl(q->masterFd, FIONREAD, &available) returns 0 available data. Subsequently reading 0 bytes also returns 0 (with errno=0). This is considered EOF which then prevents further read events from being reported.

tty_ioctl(4) documents FIONREAD as: "Get the number of bytes in the input buffer."
Not sure if "0 available bytes" should be interpreted as EOF though for ptmx devices.

Attached is a debug patch. I did not observe ioctl failing in the output. Put a gdb breakpoint on the "!readBytes" block (line with "readNotifier->setEnabled(false);") and invoke "call fflush(0)". If you "return", the output is processed normally.

available 4095
fd=12 readBytes=4095 errno=0
available 4095
fd=12 readBytes=4095 errno=0
available 0
fd=12 readBytes=0 errno=0

Would you see any issues with returning false (without logging other warnings nor emitting signals) when available==0? Surely if there is really an EOF condition, read would return 0 which actually detects the EOF?
Comment 9 Egmont Koblinger 2017-02-18 10:02:43 UTC
Without having looked at the code, I'm wondering how FIONREAD can return 0... I mean I guess it's preceded by a select() or similar that waits for data to appear, and the reason for reaching FIONREAD is that it signaled some data available. But maybe this is where somehow (how?) multithreaded race condition-y stuff kicks in.

As far as I know, terminal devices don't have the concept of EOF, 0 should mean "oops someone just grabbed that data before us". (There's a concept of EOF, as in "stty eof" in the other part, from the keyboard to the kernel driver.)
Comment 10 Peter Wu 2017-02-18 12:54:49 UTC
(In reply to Hello71 from comment #5)
> this is clearly some type of race condition, since it only breaks sometimes.
> 
> you can try "taskset -c 0 konsole", which for me fixes the problem.

Even with "tasksel -c 0 konsole" I can reproduce the problem if I repeat the experiment often enough.

(In reply to Egmont Koblinger from comment #9)
> Without having looked at the code, I'm wondering how FIONREAD can return
> 0... I mean I guess it's preceded by a select() or similar that waits for
> data to appear, and the reason for reaching FIONREAD is that it signaled
> some data available. But maybe this is where somehow (how?) multithreaded
> race condition-y stuff kicks in.

Comparing the "perf trace" output you can basically observe the following pattern:
GOOD:
xxd: write(</dev/pts/7>, ..., count: 68) ...
konsole: write(</dev/ptmx>, ..., count: 1) = 1    (is this "^C"?)
konsole: poll() = 1
xxd: ... [continued] write = 67
konsole: ioctl(</dev/ptmx>, FIONREAD) = 0
konsole: read(</dev/ptmx>, ..., count: 2) = 2
bash: ... [continued] wait4() = (xxd)

BAD:
xxd: write(</dev/pts/7>, ..., count: 68) ...
konsole: write(</dev/ptmx>, ..., count: 1) = 1
konsole: poll() = 1
konsole: ioctl(</dev/ptmx>, FIONREAD) = 0
konsole: read(</dev/ptmx>, ..., count: 0) = 0
xxd: ... [continued] write = 67
bash: ... [continued] wait4() = (xxd)


In another (newer) trace I found this:
GOOD:
xxd: write(</dev/pts/60>, ..., count: 68) ...
konsole: write(</dev/ptmx>, ..., count: 1) = 1
konsole: poll() = 1
konsole: ioctl(</dev/ptmx>, FIONREAD) = 0
konsole: read(</dev/ptmx>, ..., count: 2) = 2
xxd: ... [continued] write = 67
bash: ... [continued] wait4() = (xxd)

BAD:
xxd: write(</dev/pts/60>, ..., count: 68) = 67
konsole: write(</dev/ptmx>, ..., count: 1) = 1
konsole: poll() = 1
konsole: ioctl(</dev/ptmx>, FIONREAD) = 0
konsole: read(</dev/ptmx>, ..., count: 0) = 0
bash: ... [continued] wait4() = (xxd)


So apparently there is a pending write from the slave device which seems to trigger poll (I guess?). But when the input buffer of the master device is checked, the data is not present (FIONREAD.available=0, hence read count 0). Maybe it is a Linux kernel issue (currently 4.9.5-1-ARCH) or reliance on invalid assumptions?

It shouldn't hurt to not disable the file watcher? If the class is destroyed, then the watcher is killed anyway?
Comment 11 Peter Wu 2017-03-04 21:50:08 UTC
Looks like a race condition. Reproduced in QEMU 2.8.0 with Arch Linux and Konsole 16.12.2 with plasma-desktop 5.9.3 (and also openbox). Cannot reproduce with just one CPU (need -smp 2 or more). Command to start QEMU:

qemu-system-x86_64 -M pc,accel=kvm -m 2G -vga qxl -net user -net nic,model=virtio -drive if=virtio,file=plasma.qcow2 -initrd initrd.img -kernel arch/x86/boot/bzImage -append "rw root=/dev/vda1" -smp 4

A normal user was created with autologin (agetty --autologin). In openbox (where no background activity exists), it was necessary to run "while ls&ls;do :;done" in a separate xterm process (which would display just the directory "x" in the home directory on an 8GB ext4 filesystem).

Tested kernels (BAD = got hang, OK = could not reproduce hang within a few minutes):
BAD v4.9.11
BAD v4.10.1
BAD v4.1.38
OK v3.12.70 (Note: to update display, had to switch between QEMU monitor and back)
OK v3.18.48
OK v4.0.9 (needed "ln -s compiler-gcc5.h ../include/linux/compiler-gcc6.h" to fix GCC6 build)
BAD v4.1.38 (yes, still bad, just to be sure)
BAD v4.1.15
OK v4.1.1 (could not reproduce)
OK v4.1.7-44-g49b85054a83d (git bisect starts)
BAD v4.1.13-91-g8d0fe5721d27
[OK v4.1.10-89-g5eb491ba5d06 (actually, this is also BAD! see below)]
BAD v4.1.12-7-g4508582e6a83
[OK v4.1.10-174-gc1d40e01ad8c (actually, this is also BAD! see below)]
BAD v4.1.11-13-g7b61554c25cb
BAD v4.1.10-195-g614ea4ea2c3f (hmm, I suspect that this commit is bad)
BAD v4.1.10-194-ga0533fb8cf60 (wait, I can still reproduce? did not expect that, bisect probably wrong)
BAD v4.1.10-184-g0cf68c236f11
BAD v4.1.10-179-g249af812dcf3
BAD v4.1.10-174-gc1d40e01ad8c (not good, bad! Can also reproduce issue)
BAD v4.1.10-89-g5eb491ba5d06 (finally got hang after trying for 5 minutes)
OK v4.1.1 (can still not reproduce after 8 minutes)
[v4.1.10-89-g5eb491ba5d06 (can still not reproduce after 6 minutes)]
[BAD v4.1.10-194-ga0533fb8cf60 (can still reproduce after 2 minutes)]

Gave up after 2-3 hours of trying... The first known bad version is v4.1.10-89-g5eb491ba5d06. The first *possibly* good version is v4.1.7-44-g49b85054a83d, but a quick scan through the commits (keywords, "tty", "console", "lock") does not yield suspicuous commits.

I think I'll just try to patch this without knowing the root cause.
Comment 12 Peter Wu 2017-03-04 22:57:44 UTC
Proposed patch: https://git.reviewboard.kde.org/r/129984/
Comment 13 Martin Sandsmark 2017-03-26 15:04:07 UTC
Are you sure that 4.1 was okay?

There is only a single commit to any relevant tty code that I can spot between v4.1.1 and v4.1.10-89-g5eb491ba5d06: https://www.spinics.net/lists/stable-commits/msg47515.html (3b19e032295647b7be2aa3be62510db4aaeda759).

There's of course a bunch of commits that might be the source in e. g. kernel/ as well, though.

I'll see if I can try to narrow it down further.
Comment 14 Peter Wu 2017-03-26 20:08:12 UTC
(In reply to Martin Sandsmark from comment #13)
> Are you sure that 4.1 was okay?

Not sure if it was okay, but I wasn't able to reproduce the issue. The difficult/annoying thing is that some versions easily trigger the problem while others only show up after trying for some time.

> There is only a single commit to any relevant tty code that I can spot
> between v4.1.1 and v4.1.10-89-g5eb491ba5d06:
> https://www.spinics.net/lists/stable-commits/msg47515.html
> (3b19e032295647b7be2aa3be62510db4aaeda759).

which first ended up in v4.1.5 as ba3961ad681981dc74fcd519b8f98be8bc3ac381
Comment 15 Martin Sandsmark 2017-03-26 22:42:09 UTC
Well, it wasn't that commit, managed to reproduce it with that reverted.

Running an arch install pretty similar to your setup but with virtualbox. Also made an unholy mess of shell scripts that seems to trigger it more consistently, only problem is that it takes forever to build the kernel (I'm too lazy to modify the config to trim it down).
Comment 16 Martin Sandsmark 2017-03-26 22:45:40 UTC
also, I'm not sure if I'm somehow triggering this bug instead sometimes (or if these are related): https://bugs.kde.org/show_bug.cgi?id=230184
Comment 17 Peter Wu 2017-03-26 23:12:31 UTC
Created attachment 104751 [details]
smaller kernel config that is able to boot archiso

(In reply to Martin Sandsmark from comment #15)
> Well, it wasn't that commit, managed to reproduce it with that reverted.

What about testing the previous commit? Simply reverting might not work as there can be other patches that cause this issue.

> Running an arch install pretty similar to your setup but with virtualbox.
> Also made an unholy mess of shell scripts that seems to trigger it more
> consistently, only problem is that it takes forever to build the kernel (I'm
> too lazy to modify the config to trim it down).

I am attaching a kernel config from which I based my test kernels (the kernel config I used for bisection got lost). It also has ACPI debugging features enabled (remainder of some other bughunt), you could disable that if you want. What do you use for "quickly" triggering the issue? Something automated would be awesome :)

(In reply to Martin Sandsmark from comment #16)
> also, I'm not sure if I'm somehow triggering this bug instead sometimes (or
> if these are related): https://bugs.kde.org/show_bug.cgi?id=230184

It looks unrelated, that seems to be an issue with firing the timer, the issue in this bug is caused by a watcher that is removed.
Comment 18 Martin Sandsmark 2017-03-27 07:11:49 UTC
> What do you use for "quickly" triggering the issue? Something automated would be awesome :)

I started by trying to do that by automatically launching xxd in the background and then sleeping and killall xxd, but it didn't seem to work. but it since it did this in a loop, when I tried to ctrl+c it it just continued the next iteration, which meant I could quickly ctrl+c a bunch to trigger it.

> It looks unrelated, that seems to be an issue with firing the timer, the issue in this bug is caused by a watcher that is removed.

the cause is probably different, but the result is more or less the same (spamming a ton of output -> konsole seemingly hangs).
Comment 19 Peter Wu 2017-05-21 10:20:14 UTC
This bug is possibly resolved in the Linux kernel with this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=77dae6134440420bac334581a3ccee94cee1c054

This change (v4.11-rc2-16-g77dae6134440) has been backported as:
v4.11.1-99-g34e01f920739
v4.10.16-82-g58d479441029
v4.9.28-70-g89c91ea37581
v4.4.68-46-g9bd2cc56a089
Comment 20 Martin Sandsmark 2017-06-17 12:07:55 UTC
That explains why I haven't seen that in a while. Feel free to re-open if you manage to reproduce it with a kernel version with that patch in, I can't do that here.