Bug 452274 - memcheck crashes with Assertion 'sci->status.what == SsIdle' failed
Summary: memcheck crashes with Assertion 'sci->status.what == SsIdle' failed
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (other bugs)
Version First Reported In: 3.19 GIT
Platform: FreeBSD Ports FreeBSD
: NOR crash
Target Milestone: ---
Assignee: Paul Floyd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-04 20:09 UTC by Nick Briggs
Modified: 2022-04-13 07:14 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
C source test case (689 bytes, text/x-csrc)
2022-04-07 21:13 UTC, Nick Briggs
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Briggs 2022-04-04 20:09:16 UTC
SUMMARY

Running the latest valgrind from git, valgrind-3.19.0.RC1-615731617b-20220404 on FreeBSD 13.0-RELEASE-p8 on i386 hardware
generated an assertion failure:

==56733== Memcheck, a memory error detector
==56733== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==56733== Using Valgrind-3.19.0.RC1 and LibVEX; rerun with -h for copyright info
==56733== Command: /home/briggs/maiko/freebsd.386/ldex -timer 1000000 -g 1440x900 -sc 1440x900 -m 256 -t Medley_Interlisp /usr/home/briggs/medley/tmp/full.sysout
==56733== Parent PID: 56731
==56733== 

valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==56733==    at 0x38116969: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==56733==    by 0x38116C46: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 209 (lwpid 100366)
==56733==    at 0x754AD3F: _sigprocmask (in /lib/libc.so.7)
==56733==    by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==56733==    by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==56733==    by 0x381A05A3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==56733==    by 0x7548F3E: poll (in /lib/libc.so.7)
==56733==    by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==56733==    by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0)
==56733==    by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0)
==56733==    by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0)
==56733==    by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0)
==56733==    by 0x446379: clipping_Xbitblt (xbbt.c:55)
==56733==    by 0x436ECD: flush_display_lineregion (initdsp.c:346)
==56733==    by 0x4194B0: newbltchar (bbtsub.c:1424)
==56733==    by 0x429F67: OP_subrcall (subr.c:374)
==56733==    by 0x41FE75: dispatch (xc.c:653)
==56733==    by 0x43468B: start_lisp (main.c:612)
==56733==    by 0x43428D: main (main.c:559)
client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0
valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576
Comment 1 Paul Floyd 2022-04-06 15:02:35 UTC
Hi Nick

Is this a regression compared to a previous version?
Comment 2 Paul Floyd 2022-04-06 15:35:10 UTC
Also, though this looks like it could be good old syscall sigreturn, could you confirm that by running with

--trace-syscalls=yes

and post the last syscall before the assert? It looks like it is SYS_poll
Comment 3 Paul Floyd 2022-04-06 15:54:41 UTC
Last think for the moment

This change

commit bbc3bcab0ae7aa01a116c05c52c66a6714a7df12
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sun Apr 3 15:50:38 2022 +0200

    Bug 451843 - valgrind fails to start on a FreeBSD system which enforces W^X

    Also add FreeBSD 13.1 to configure.ac

requires re-running autogen/configure

Could you confirm that you did that?

(actually I missed one occurrence, I'll push that tonight, affects stack traces in messages)
Comment 4 Nick Briggs 2022-04-06 15:58:00 UTC
It's different from valgrind-3.18.1-42b08ed5bd-20211015, where it typically dies with

==73265== Memcheck, a memory error detector
==73265== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==73265== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==73265== Command: /home/briggs/maiko/freebsd.386/ldex -timer 1000000 -g 1440x900 -sc 1440x900 -m 256 -t Medley_Interlisp /usr/home/briggs/medley/tmp/full.sysout
==73265== Parent PID: 73264
==73265== 
==73265== Syscall param sigprocmask(set) points to uninitialised byte(s)
==73265==    at 0x7548D3F: _sigprocmask (in /lib/libc.so.7)
==73265==    by 0x76992F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==73265==    by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==73265==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==73265==    by 0x754709E: recvmsg (in /lib/libc.so.7)
==73265==    by 0x765C3DE: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==73265==    by 0x765CF39: xcb_poll_for_event (in /usr/local/lib/libxcb.so.1.1.0)
==73265==    by 0x73BD47F: ??? (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BC69C: ??? (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BC294: _XEventsQueued (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BCCEE: _XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x739F439: XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73265==  Address 0xfbbfdcec is on thread 1's stack
==73265==  in frame #2, created by ??? (src/lib/libthr/thread/thr_sig.c:)
==73265== 
==73265== Syscall param sigreturn(ucp) points to uninitialised byte(s)
==73265==    at 0x75473FD: syscall (in /lib/libc.so.7)
==73265==    by 0x76993AA: ??? (src/lib/libthr/thread/thr_sig.c:319)
==73265==    by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==73265==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==73265==    by 0x754709E: recvmsg (in /lib/libc.so.7)
==73265==    by 0x765C3DE: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==73265==    by 0x765CF39: xcb_poll_for_event (in /usr/local/lib/libxcb.so.1.1.0)
==73265==    by 0x73BD47F: ??? (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BC69C: ??? (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BC294: _XEventsQueued (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x73BCCEE: _XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73265==    by 0x739F439: XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73265==  Address 0xfbbfd9e0 is on thread 1's stack
==73265==  in frame #1, created by ??? (src/lib/libthr/thread/thr_sig.c:)
==73265== 
==73265== Invalid read of size 4
==73265==    at 0x769926B: ??? (src/lib/libthr/thread/thr_sig.c:262)
==73265==    by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==73265==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==73265==    by 0x75471AE: sigprocmask (in /lib/libc.so.7)
==73265==    by 0x74CEA84: setjmp (in /lib/libc.so.7)
==73265==  Address 0x1bd97094 is not stack'd, malloc'd or (recently) free'd
==73265== 
==73265== 
==73265== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==73265==  Access not within mapped region at address 0x1BD97094
==73265==    at 0x769926B: ??? (src/lib/libthr/thread/thr_sig.c:262)
==73265==    by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==73265==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==73265==    by 0x75471AE: sigprocmask (in /lib/libc.so.7)
==73265==    by 0x74CEA84: setjmp (in /lib/libc.so.7)
==73265==  If you believe this happened as a result of a stack
==73265==  overflow in your program's main thread (unlikely but
==73265==  possible), you can try to increase the size of the
==73265==  main thread stack using the --main-stacksize= flag.
==73265==  The main thread stack size used in this run was 16777216.
==73265== 
==73265== HEAP SUMMARY:
==73265==     in use at exit: 269,922,233 bytes in 692 blocks
==73265==   total heap usage: 1,699 allocs, 1,007 frees, 270,522,703 bytes allocated
==73265== 
 along with
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags fbbfe270
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags fbbfe270
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090
pid 73265 (memcheck-x86-freebs): sigreturn eflags = 0x0
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 1000
pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 1000
Segmentation fault

-- but per your request for trace syscalls, using the most recent version, yes it's using sys_poll, and yes, it's spewing:
pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090
pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090
pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090


SYSCALL[73380,1](116) sys_gettimeofday ( 0xfbbfe82c, 0x0 )[sync] --> Success(0x0) 
SYSCALL[73380,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 )
 --> [async] ... 

valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==73380==    at 0x38116969: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==73380==    by 0x38116C46: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 209 (lwpid 101239)
==73380==    at 0x754AD3F: _sigprocmask (in /lib/libc.so.7)
==73380==    by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==73380==    by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==73380==    by 0x381A05A3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==73380==    by 0x7548F3E: poll (in /lib/libc.so.7)
==73380==    by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==73380==    by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0)
==73380==    by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0)
==73380==    by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73380==    by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0)
==73380==    by 0x446379: clipping_Xbitblt (xbbt.c:55)
==73380==    by 0x436E2D: flush_display_region (initdsp.c:303)
==73380==    by 0x417682: bitblt_bitmap (bbtsub.c:834)
==73380==    by 0x429787: OP_subrcall (subr.c:385)
==73380==    by 0x41FE75: dispatch (xc.c:653)
==73380==    by 0x43468B: start_lisp (main.c:612)
==73380==    by 0x43428D: main (main.c:559)
client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0
valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

and yes, I re-ran autogen.sh, configure, (compiling with clang) and rebuilt it all.
-rwxr-xr-x  1 briggs  briggs  562723 Apr  4 12:36 configure
-rwxr-xr-x  1 briggs  briggs  170925 Apr  4 12:03 configure.ac
Comment 5 Paul Floyd 2022-04-06 16:20:32 UTC
OK thanks.

In that case I have an idea.

Back to https://github.com/paulfloyd/freebsd_valgrind/issues/137, the flow is

signal raised -> pth_sighandler -> user sighandler -> return to pth_sighandler -> call sigreturn -> sigreturn fails -> annoying kernel message -> return to Valgrind trampiline -> syscall fake_sigreturn -> back to VEX CPU

but what would happen if "sigreturn fails" doesn't fail? That's really bad news and will almost certainly result in a crash. We need it to fail in order to get back to fake_sigreturn.

I'll try to see what happens if I do something like set bit 4 of mc_flags in the synthesized ucontext to ensure that sigreturn fails in the mc_flags check (which is the least-worst option at the moment).
Comment 6 Paul Floyd 2022-04-06 19:54:03 UTC
Nick, could you try applying the following diff

diff --git a/coregrind/m_syswrap/syswrap-x86-freebsd.c b/coregrind/m_syswrap/syswrap-x86-freebsd.c
index 0ba304952..1283f637b 100644
--- a/coregrind/m_syswrap/syswrap-x86-freebsd.c
+++ b/coregrind/m_syswrap/syswrap-x86-freebsd.c
@@ -863,7 +863,7 @@ static void fill_mcontext(ThreadState *tst, struct vki_mcontext *sc)
    sc->es = tst->arch.vex.guest_ES;
    sc->fs = tst->arch.vex.guest_FS;
    sc->gs = tst->arch.vex.guest_GS;
-   sc->eflags = LibVEX_GuestX86_get_eflags(&tst->arch.vex);
+   sc->eflags = LibVEX_GuestX86_get_eflags(&tst->arch.vex) | 0x400000;
    /*
       not yet.
       VG_(memcpy)(&sc->fpstate, fpstate, sizeof(*fpstate));

0x400000 is not used, but it is also not an allowed user flag that can change.

Does this prevent the crash?

Running x86 on amd64 I get
pid 3067 (memcheck-x86-freebs): freebsd32_sigreturn eflags = 0x200091
Comment 7 Nick Briggs 2022-04-06 20:09:58 UTC
It generates plenty of

pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200004
pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200094
pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200094
pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084
pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084
pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084

but it still crashes with

SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... 
SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) 
SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 )
 --> [async] ... 
SYSCALL[79880,1](209) ... [async] --> Success(0x1) 
SYSCALL[79880,1](121) sys_writev ( 7, 0xfbbfe6dc, 3 ) --> [async] ... 
SYSCALL[79880,1](121) ... [async] --> Success(0x48) 
SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... 
SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) 
SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 )
 --> [async] ... 
SYSCALL[79880,1](209) ... [async] --> Success(0x1) 
SYSCALL[79880,1](121) sys_writev ( 7, 0xfbbfe6dc, 3 ) --> [async] ... 
SYSCALL[79880,1](121) ... [async] --> Success(0x48) 
SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... 
SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) 
SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 )
 --> [async] ... 

valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==79880==    at 0x38116979: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==79880==    by 0x38116C56: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 209 (lwpid 100159)
==79880==    at 0x754AD3F: _sigprocmask (in /lib/libc.so.7)
==79880==    by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==79880==    by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==79880==    by 0x381A05B3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==79880==    by 0x7548F3E: poll (in /lib/libc.so.7)
==79880==    by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==79880==    by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0)
==79880==    by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0)
==79880==    by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0)
==79880==    by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0)
==79880==    by 0x446379: clipping_Xbitblt (xbbt.c:55)
==79880==    by 0x436ECD: flush_display_lineregion (initdsp.c:346)
==79880==    by 0x4194B0: newbltchar (bbtsub.c:1424)
==79880==    by 0x429F67: OP_subrcall (subr.c:374)
==79880==    by 0x41FE75: dispatch (xc.c:653)
==79880==    by 0x43468B: start_lisp (main.c:612)
==79880==    by 0x43428D: main (main.c:559)
client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0
valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576
Comment 8 Paul Floyd 2022-04-06 20:41:21 UTC
I was beginning to suspect so.

The crash site seems to be

	__sys_sigprocmask(SIG_SETMASK, &actp->sa_mask, NULL);

So I assume that is somehow getting a junk pointer for the mask. Need to look at how __libc_sigaction_slot works.
Comment 9 Paul Floyd 2022-04-07 09:04:13 UTC
Or more accurately, the guest seems to be performing (or just completed) sigprocmask but the host is starting to make a new syscall.

The assert triggers because the host is expecting the syscall state to be idle but it mist be something else (complete or hand to kernel, or a corrupt value).

Could you add something like

VG_(printf)("DEBUG: %s %d\n", __func__, sci->status.what);

just before the assert on line 2130 of m_syswrap/syswrap-main.c and report back as to whether 'what' is 1, 2 or junk?
Comment 10 Nick Briggs 2022-04-07 17:20:33 UTC
It's 2 (SsHandToKernel) when it crashes.
Comment 11 Paul Floyd 2022-04-07 19:21:36 UTC
That means that it is trying to start a new syscall whilst another is in flight.

In the logs I see poll() then sigprocmask(). But that doesn't mean that the new syscall is sigprocmask, it could be some other syscall caused by an async signal.

How easy would it be for me to build maiko? I mainly use VirtualBox and just a shell on platforms other than 13.0-RELEASE amd64.

Otherwise, could you modify syswrap-main.c again

   vg_assert(sci->status.what == SsIdle);

   getSyscallArgsFromGuestState( &sci->orig_args, &tst->arch.vex, trc );

   /* Copy .orig_args to .args.  The pre-handler may modify .args, but
      we want to keep the originals too, just in case. */
   sci->args = sci->orig_args;

   /* Save the syscall number in the thread state in case the syscall 
      is interrupted by a signal. */
   sysno = sci->orig_args.sysno;

-> put a VG_(printf) of sysno here, and move the assert after the printf
Comment 12 Nick Briggs 2022-04-07 19:28:40 UTC
It's easier for you to build maiko now than it used to be. See https://github.com/Interlisp/maiko but you also need a medley Interlisp release, https://github.com/interlisp/medley -- and it gets a bit tricky to get to the right place to cause this trouble.

I'll go ahead and make the changes you asked for and report back.
Comment 13 Nick Briggs 2022-04-07 19:38:42 UTC
DEBUG: vgPlain_client_syscall sysno 121
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 116
DEBUG: vgPlain_client_syscall sysno 116
DEBUG: vgPlain_client_syscall sysno 116
DEBUG: vgPlain_client_syscall sysno 240
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 116
DEBUG: vgPlain_client_syscall sysno 209
DEBUG: vgPlain_client_syscall sysno 121
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 209
DEBUG: vgPlain_client_syscall sysno 121
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 209
DEBUG: vgPlain_client_syscall sysno 121
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 209
DEBUG: vgPlain_client_syscall sysno 121
DEBUG: vgPlain_client_syscall sysno 27
DEBUG: vgPlain_client_syscall sysno 209
DEBUG: vgPlain_client_syscall 2
DEBUG: vgPlain_client_syscall sysno 340

valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==94241==    at 0x381169A9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==94241==    by 0x38116C86: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 340 (lwpid 100133)
==94241==    at 0x754AD3F: _sigprocmask (in /lib/libc.so.7)
==94241==    by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==94241==    by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==94241==    by 0x381A05E3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==94241==    by 0x7548F3E: poll (in /lib/libc.so.7)
==94241==    by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0)
Comment 14 Nick Briggs 2022-04-07 19:54:06 UTC
I turned back on the trace syscalls and ran it again (also fixed up the msg about the state) --

DEBUG: vgPlain_client_syscall sysno 209
SYSCALL[94860,1](209) sys_poll ( 0xfbbfe5c0, 1, -1 )
 --> [async] ... 
SYSCALL[94860,1](209) ... [async] --> Success(0x1) 
DEBUG: vgPlain_client_syscall sysno 121
SYSCALL[94860,1](121) sys_writev ( 7, 0xfbbfe6bc, 3 ) --> [async] ... 
SYSCALL[94860,1](121) ... [async] --> Success(0x48) 
DEBUG: vgPlain_client_syscall sysno 27
SYSCALL[94860,1]( 27) sys_recvmsg ( 7, 0xfbbfe5cc, 0 ) --> [async] ... 
SYSCALL[94860,1]( 27) ... [async] --> Failure(0x23) 
DEBUG: vgPlain_client_syscall sysno 209
SYSCALL[94860,1](209) sys_poll ( 0xfbbfe5c0, 1, -1 )
 --> [async] ... 
SYSCALL[94860,1](209) ... [async] --> Success(0x1) 
DEBUG: vgPlain_client_syscall sysno 121
SYSCALL[94860,1](121) sys_writev ( 7, 0xfbbfe6bc, 3 ) --> [async] ... 
DEBUG: vgPlain_client_syscall sci->status.what 2
DEBUG: vgPlain_client_syscall sysno 340

valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==94860==    at 0x381169B9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==94860==    by 0x38116C96: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 340 (lwpid 100133)
==94860==    at 0x754AD3F: _sigprocmask (in /lib/libc.so.7)
==94860==    by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288)
==94860==    by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==94860==    by 0x381A05F3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==94860==    by 0x754939E: writev (in /lib/libc.so.7)
==94860==    by 0x7768808: ??? (in /usr/local/lib/libxcb.so.1.1.0)
==94860==    by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0)
==94860==    by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0)
Comment 15 Nick Briggs 2022-04-07 20:00:26 UTC
In case it matters... I just looked at the maiko code that sets up the VTALRM signal, notice the SA_RESTART in the flags.

  struct itimerval timert;
  struct sigaction timer_action;

  timer_action.sa_handler = int_timer_service;
  sigemptyset(&timer_action.sa_mask);
  timer_action.sa_flags = SA_RESTART;

  if (sigaction(SIGVTALRM, &timer_action, NULL) == -1) {
    perror("sigaction: SIGVTALRM");
  }

  /* then attach a timer to it and turn it loose */
  timert.it_interval.tv_sec = timert.it_value.tv_sec = 0;
  timert.it_interval.tv_usec = timert.it_value.tv_usec = TIMER_INTERVAL;
  setitimer(ITIMER_VIRTUAL, &timert, NULL);
Comment 16 Nick Briggs 2022-04-07 20:09:07 UTC
Red herring.  Removing the SA_RESTART doesn't change anything.
Comment 17 Nick Briggs 2022-04-07 21:13:17 UTC
Created attachment 148033 [details]
C source test case

This test case provokes the problem.
cc -O2 -pipe  vgtest2.c  -o vgtest2

Running under valgrind:
[...]
 DEBUG: vgPlain_client_syscall sysno 4
 DEBUG: vgPlain_client_syscall sysno 4
 DEBUG: vgPlain_client_syscall sysno 4
 DEBUG: vgPlain_client_syscall sci->status.what 2
DEBUG: vgPlain_client_syscall sysno 1000

valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==95237==    at 0x381169B9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==95237==    by 0x38116C96: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 1000 (lwpid 100137)
==95237==    at 0x381A0601: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==95237==    by 0x4018FD: main (in /var/tmp/vgtest2)
client stack range: [0xFBBFC000 0xFBBFEFFF] client SP: 0xFBBFE1E8
valgrind stack range: [0x51BE000 0x52BDFFF] top usage: 5792 of 1048576
Comment 18 Paul Floyd 2022-04-08 11:33:32 UTC
I just double checked that there are no problems with https://bugs.kde.org/show_bug.cgi?id=445032 which is fairly similar. Works OK with i386/ amd64, no pthread and with pthread.

This testcase _fails_ on all 4 combinations.

Here, amd64 no pthread

valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==39165==    at 0x38106032: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd)
==39165==    by 0x402AA9FDF: ???
==39165==    by 0x7C: ???
==39165==    by 0x38106031: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd)
==39165==    by 0x402AA984F: ???
==39165==    by 0x40200838F: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall   4 (lwpid 102465)
==39165==    at 0x3819A5D8: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd)
==39165==    by 0x2017AF: (below main) (src/lib/csu/amd64/crt1_c.c:75)
client stack range: [0x7FBFFC000 0x7FC000FFF] client SP: 0x7FBFFF120
valgrind stack range: [0x4029AA000 0x402AA9FFF] top usage: 8312 of 1048576

With ktrace

 39225 memcheck-amd64-free RET   read 1
 39225 memcheck-amd64-free CALL  thr_self(0x402aa97c8)
 39225 memcheck-amd64-free RET   thr_self 0
 39225 memcheck-amd64-free CALL  sigprocmask(SIG_SETMASK,0x402aa9e70,0)
 39225 memcheck-amd64-free RET   sigprocmask 0
 39225 memcheck-amd64-free CALL  sigprocmask(SIG_SETMASK,0x402aa9e08,0x402aa9df8)
 39225 memcheck-amd64-free RET   sigprocmask 0
 39225 memcheck-amd64-free CALL  sigtimedwait(0x402aa9de8,0x402aa9e08,0x38064a58)
 39225 memcheck-amd64-free RET   sigtimedwait -1 errno 35 Resource temporarily unavailable
 39225 memcheck-amd64-free CALL  sigprocmask(SIG_SETMASK,0x402aa9df8,0)
 39225 memcheck-amd64-free RET   sigprocmask 0
 39225 memcheck-amd64-free CALL  write(0x72c98,0x402aa99f8,0x87)
 39225 memcheck-amd64-free GIO   fd 470168 wrote 135 bytes
       "
        valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.
Comment 19 Paul Floyd 2022-04-08 18:27:36 UTC
So what I think is happening is that a non-blocking write is occurring which results in an immediate return with EAGAIN (Failure(0x23)  in the logs). Nevertheless the syscall is being marked as SfMayBlock.

m_signals.c has this comment

   So what we have to do, when doing any syscall which SfMayBlock, is to
   quickly switch in the SCSS-specified signal mask just before the
   syscall, and switch it back just afterwards, and hope that we don't
   get caught up in some weird race condition.  This is the primary
   purpose of the ultra-magical pieces of assembly code in
   coregrind/m_syswrap/syscall-<plat>.S

I'm not sure if this is such a weird race condition, or whether it was just wrong to set SfMayBlock.

I tried commenting out 
   *flags |= SfMayBlock;

in PRE(sys_write), syswrap-generic.c line 4280. With the the small testcase works. 

I doubt that this is a real solution, it may well cause hangs in other situations.

I then tried to be a bit more clever with

   int fd_flags = VG_(fcntl(ARG1, VKI_F_GETFL, 0));
   if ((fd_flags & VKI_O_NONBLOCK) == 0)
      *flags |= SfMayBlock;

but for fd 2 I just see flags of 0xa (write and append).
Comment 20 Nick Briggs 2022-04-08 18:54:43 UTC
I'm not sure why the write() would be flagged non-blocking.
You can replace the write(...) in the test with read(0, NULL, 0) and get the same result.
I also tried
    int fd = open("/tmp/dummy", O_WRONLY, O_CREAT);
    if (fd != -1) close(fd);
and got the failure.   I wonder if it's possible that it fails with any of the syscalls that are in the list in sigaction's SA_RESTART (although actually *using* SA_RESTART doesn't seem to affect the test case)

%%%%%
The affected system calls include open(2), read(2), write(2),
     sendto(2), recvfrom(2), sendmsg(2) and recvmsg(2) on a communications
     channel or a slow device (such as a terminal, but not a regular file) and
     during a wait(2) or ioctl(2).  However, calls that have already committed
     are not restarted, but instead return a partial success (for example, a
     short read count).
%%%%%

I've made it fail with read(), write(), open(), wait(), but neither stat() nor usleep().
Comment 21 Nick Briggs 2022-04-08 19:22:57 UTC
Here's what happens with --trace-signals=yes and using the write(...) as the thing to be interrupted:

DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
--12231-- async signal handler: signal=26, vgtid=100118, tid=1, si_code=65542, exitreason VgSrc_None
--12231-- interrupted_syscall: tid=1, ip=0x381a1e02, restart=False, sres.isErr=False, sres.val=0
--12231--   not started: restarting
--12231-- delivering signal 26 (SIGVTALRM):65542 to thread 1
--12231-- push_signal_frame (thread 1): signal 26
==12231==    at 0x7303DBD: _write (in /lib/libc.so.7)
==12231==    by 0x730236E: write (in /lib/libc.so.7)
==12231==    by 0x4018ED: main (in /var/tmp/vgtest2)
DEBUG: vgPlain_client_syscall sysno 1000
--12231-- VG_(signal_return) (thread 1): EIP=0x7303dbd
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
--12231-- async signal handler: signal=26, vgtid=100118, tid=1, si_code=65542, exitreason VgSrc_None
--12231-- interrupted_syscall: tid=1, ip=0x381e4921, restart=False, sres.isErr=False, sres.val=0
--12231--   not in syscall at all: hmm, very suspicious
--12231-- delivering signal 26 (SIGVTALRM):65542 to thread 1
--12231-- push_signal_frame (thread 1): signal 26
==12231==    at 0x7303DBF: _write (in /lib/libc.so.7)
==12231==    by 0x730236E: write (in /lib/libc.so.7)
==12231==    by 0x4018ED: main (in /var/tmp/vgtest2)
DEBUG: vgPlain_client_syscall sci->status.what 2
DEBUG: vgPlain_client_syscall sysno 1000

valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.

host stacktrace:
==12231==    at 0x38116B19: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==12231==    by 0x38116DF6: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 1000 (lwpid 100118)
==12231==    at 0x381A0821: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd)
==12231==    by 0x4018ED: main (in /var/tmp/vgtest2)
client stack range: [0xFBBFC000 0xFBBFEFFF] client SP: 0xFBBFE1E8
valgrind stack range: [0x51BE000 0x52BDFFF] top usage: 5792 of 1048576

Where the "very suspicious" message is from coregrind/m_syswrap/syswrap-main.c
Comment 22 Nick Briggs 2022-04-08 19:49:27 UTC
I threw in a VG_(printf):

@@ -3117,6 +3119,7 @@ VG_(fixup_guest_state_after_syscall_interrupted)( ThreadId tid,
       if (VG_(clo_trace_signals))
          VG_(message)( Vg_DebugMsg,
                        "  not in syscall at all: hmm, very suspicious\n" );
+      VG_(printf)("DEBUG: blksys_setup %ld ip %ld blksys_finished %ld\n", ML_(blksys_setup), ip, ML_(blksys_finished));
       /* Looks like we weren't in a syscall at all.  Hmm. */
       vg_assert(sci->status.what != SsIdle);
       return;

and see:

DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: vgPlain_client_syscall sysno 4
DEBUG: blksys_setup 941235726 ip 941508929 blksys_finished 941235863
DEBUG: vgPlain_client_syscall sci->status.what 2
DEBUG: vgPlain_client_syscall sysno 1000

and that ip is in x86g_calculate_eflags_all_WRK !
Comment 23 Paul Floyd 2022-04-08 20:13:37 UTC
The write man page says

     [EAGAIN]           The file was marked for non-blocking I/O, and no data
                        could be written immediately.

So I was assuming that fd 2 to the terminal was nonblocking.
Comment 24 Paul Floyd 2022-04-08 20:14:41 UTC
(In reply to Nick Briggs from comment #22)
> I threw in a VG_(printf):
> 
> @@ -3117,6 +3119,7 @@ VG_(fixup_guest_state_after_syscall_interrupted)(
> ThreadId tid,
>        if (VG_(clo_trace_signals))
>           VG_(message)( Vg_DebugMsg,
>                         "  not in syscall at all: hmm, very suspicious\n" );
> +      VG_(printf)("DEBUG: blksys_setup %ld ip %ld blksys_finished %ld\n",
> ML_(blksys_setup), ip, ML_(blksys_finished));
>        /* Looks like we weren't in a syscall at all.  Hmm. */
>        vg_assert(sci->status.what != SsIdle);
>        return;
> 
> and see:
> 
> DEBUG: vgPlain_client_syscall sysno 4
> DEBUG: vgPlain_client_syscall sysno 4
> DEBUG: vgPlain_client_syscall sysno 4
> DEBUG: vgPlain_client_syscall sysno 4
> DEBUG: blksys_setup 941235726 ip 941508929 blksys_finished 941235863
> DEBUG: vgPlain_client_syscall sci->status.what 2
> DEBUG: vgPlain_client_syscall sysno 1000
> 
> and that ip is in x86g_calculate_eflags_all_WRK !

Very interesting.
Comment 25 Paul Floyd 2022-04-08 20:54:54 UTC
The flags seem to be used.

Now how to determine if we are in either of those functions?

As a quick hack for testing

putting 

#if defined(VGO_freebsd)
  if (outside_range)
  {
     outside_range = False;
     in_complete_to_committed = True;
  }
#endif

just before the check for outside_range seems to make things work.
Comment 26 Nick Briggs 2022-04-08 23:10:12 UTC
> Now how to determine if we are in either of those functions?

I think it's actually any of 4 functions: 

3805cd30 R vgModuleLocal_blksys_setup
3805cd34 R vgModuleLocal_blksys_restart
3805cd38 R vgModuleLocal_blksys_complete
3805cd3c R vgModuleLocal_blksys_committed
3805cd40 R vgModuleLocal_blksys_finished

It's checking
     outside_range = ip < ML_(blksys_setup) || ip >= ML_(blksys_finished)
so being in any of blksys_setup,  blksys_restart, blksys_complete, blksys_committed is supposed to be OK.

40 years ago I had to debug this sort of junk in my own assembler code with an in-circuit emulator.  Things haven't improved much, have they!
Comment 27 Paul Floyd 2022-04-12 21:52:23 UTC
Didn't make it into 3.19.0, but this should now be OK

paulf> git push
Enumerating objects: 40, done.
Counting objects: 100% (40/40), done.
Delta compression using up to 4 threads
Compressing objects: 100% (22/22), done.
Writing objects: 100% (24/24), 3.95 KiB | 3.95 MiB/s, done.
Total 24 (delta 17), reused 0 (delta 0), pack-reused 0
To ssh://sourceware.org/git/valgrind.git
   68ac5ca0e..3e7774ea5  master -> master
Comment 28 Nick Briggs 2022-04-12 22:48:01 UTC
Looks good.  Thanks!  I'd suggest, though, using something other than " " in the test (write(2, " ", 1);) so that it's more obvious if the output doesn't match, rather than trying to count spaces ;-)

Now on to the next problem where the X library chokes sometimes when SIGVTALRMs go off -- not sure that it's just valgrind though.
Comment 29 Paul Floyd 2022-04-13 07:14:35 UTC
I made 2 changes to the testcase. Reduced the number of ticks to make the runtime a lot shorter. And added a stderr filter to remove all those spaces. Without the stderr filter, the filter is unnecessarily large, and the testcase also non-determinisitic.

I think that the important thing is to be spamming a fairly slow syscall like `write` whilst having the timer firing ang generating interrupts.