SUMMARY Running the latest valgrind from git, valgrind-3.19.0.RC1-615731617b-20220404 on FreeBSD 13.0-RELEASE-p8 on i386 hardware generated an assertion failure: ==56733== Memcheck, a memory error detector ==56733== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==56733== Using Valgrind-3.19.0.RC1 and LibVEX; rerun with -h for copyright info ==56733== Command: /home/briggs/maiko/freebsd.386/ldex -timer 1000000 -g 1440x900 -sc 1440x900 -m 256 -t Medley_Interlisp /usr/home/briggs/medley/tmp/full.sysout ==56733== Parent PID: 56731 ==56733== valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==56733== at 0x38116969: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==56733== by 0x38116C46: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 209 (lwpid 100366) ==56733== at 0x754AD3F: _sigprocmask (in /lib/libc.so.7) ==56733== by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==56733== by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==56733== by 0x381A05A3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==56733== by 0x7548F3E: poll (in /lib/libc.so.7) ==56733== by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==56733== by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0) ==56733== by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0) ==56733== by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0) ==56733== by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0) ==56733== by 0x446379: clipping_Xbitblt (xbbt.c:55) ==56733== by 0x436ECD: flush_display_lineregion (initdsp.c:346) ==56733== by 0x4194B0: newbltchar (bbtsub.c:1424) ==56733== by 0x429F67: OP_subrcall (subr.c:374) ==56733== by 0x41FE75: dispatch (xc.c:653) ==56733== by 0x43468B: start_lisp (main.c:612) ==56733== by 0x43428D: main (main.c:559) client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0 valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576
Hi Nick Is this a regression compared to a previous version?
Also, though this looks like it could be good old syscall sigreturn, could you confirm that by running with --trace-syscalls=yes and post the last syscall before the assert? It looks like it is SYS_poll
Last think for the moment This change commit bbc3bcab0ae7aa01a116c05c52c66a6714a7df12 Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Sun Apr 3 15:50:38 2022 +0200 Bug 451843 - valgrind fails to start on a FreeBSD system which enforces W^X Also add FreeBSD 13.1 to configure.ac requires re-running autogen/configure Could you confirm that you did that? (actually I missed one occurrence, I'll push that tonight, affects stack traces in messages)
It's different from valgrind-3.18.1-42b08ed5bd-20211015, where it typically dies with ==73265== Memcheck, a memory error detector ==73265== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==73265== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==73265== Command: /home/briggs/maiko/freebsd.386/ldex -timer 1000000 -g 1440x900 -sc 1440x900 -m 256 -t Medley_Interlisp /usr/home/briggs/medley/tmp/full.sysout ==73265== Parent PID: 73264 ==73265== ==73265== Syscall param sigprocmask(set) points to uninitialised byte(s) ==73265== at 0x7548D3F: _sigprocmask (in /lib/libc.so.7) ==73265== by 0x76992F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==73265== by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==73265== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==73265== by 0x754709E: recvmsg (in /lib/libc.so.7) ==73265== by 0x765C3DE: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==73265== by 0x765CF39: xcb_poll_for_event (in /usr/local/lib/libxcb.so.1.1.0) ==73265== by 0x73BD47F: ??? (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BC69C: ??? (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BC294: _XEventsQueued (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BCCEE: _XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x739F439: XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73265== Address 0xfbbfdcec is on thread 1's stack ==73265== in frame #2, created by ??? (src/lib/libthr/thread/thr_sig.c:) ==73265== ==73265== Syscall param sigreturn(ucp) points to uninitialised byte(s) ==73265== at 0x75473FD: syscall (in /lib/libc.so.7) ==73265== by 0x76993AA: ??? (src/lib/libthr/thread/thr_sig.c:319) ==73265== by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==73265== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==73265== by 0x754709E: recvmsg (in /lib/libc.so.7) ==73265== by 0x765C3DE: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==73265== by 0x765CF39: xcb_poll_for_event (in /usr/local/lib/libxcb.so.1.1.0) ==73265== by 0x73BD47F: ??? (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BC69C: ??? (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BC294: _XEventsQueued (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x73BCCEE: _XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73265== by 0x739F439: XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73265== Address 0xfbbfd9e0 is on thread 1's stack ==73265== in frame #1, created by ??? (src/lib/libthr/thread/thr_sig.c:) ==73265== ==73265== Invalid read of size 4 ==73265== at 0x769926B: ??? (src/lib/libthr/thread/thr_sig.c:262) ==73265== by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==73265== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==73265== by 0x75471AE: sigprocmask (in /lib/libc.so.7) ==73265== by 0x74CEA84: setjmp (in /lib/libc.so.7) ==73265== Address 0x1bd97094 is not stack'd, malloc'd or (recently) free'd ==73265== ==73265== ==73265== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==73265== Access not within mapped region at address 0x1BD97094 ==73265== at 0x769926B: ??? (src/lib/libthr/thread/thr_sig.c:262) ==73265== by 0x76988BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==73265== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==73265== by 0x75471AE: sigprocmask (in /lib/libc.so.7) ==73265== by 0x74CEA84: setjmp (in /lib/libc.so.7) ==73265== If you believe this happened as a result of a stack ==73265== overflow in your program's main thread (unlikely but ==73265== possible), you can try to increase the size of the ==73265== main thread stack using the --main-stacksize= flag. ==73265== The main thread stack size used in this run was 16777216. ==73265== ==73265== HEAP SUMMARY: ==73265== in use at exit: 269,922,233 bytes in 692 blocks ==73265== total heap usage: 1,699 allocs, 1,007 frees, 270,522,703 bytes allocated ==73265== along with pid 73265 (memcheck-x86-freebs): sigreturn mc_flags fbbfe270 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags fbbfe270 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 90909090 pid 73265 (memcheck-x86-freebs): sigreturn eflags = 0x0 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 1000 pid 73265 (memcheck-x86-freebs): sigreturn mc_flags 1000 Segmentation fault -- but per your request for trace syscalls, using the most recent version, yes it's using sys_poll, and yes, it's spewing: pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090 pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090 pid 73380 (memcheck-x86-freebs): sigreturn eflags = 0x200090 SYSCALL[73380,1](116) sys_gettimeofday ( 0xfbbfe82c, 0x0 )[sync] --> Success(0x0) SYSCALL[73380,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 ) --> [async] ... valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==73380== at 0x38116969: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==73380== by 0x38116C46: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 209 (lwpid 101239) ==73380== at 0x754AD3F: _sigprocmask (in /lib/libc.so.7) ==73380== by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==73380== by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==73380== by 0x381A05A3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==73380== by 0x7548F3E: poll (in /lib/libc.so.7) ==73380== by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==73380== by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0) ==73380== by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0) ==73380== by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73380== by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0) ==73380== by 0x446379: clipping_Xbitblt (xbbt.c:55) ==73380== by 0x436E2D: flush_display_region (initdsp.c:303) ==73380== by 0x417682: bitblt_bitmap (bbtsub.c:834) ==73380== by 0x429787: OP_subrcall (subr.c:385) ==73380== by 0x41FE75: dispatch (xc.c:653) ==73380== by 0x43468B: start_lisp (main.c:612) ==73380== by 0x43428D: main (main.c:559) client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0 valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576 Note: see also the FAQ in the source distribution. It contains workarounds to several common problems. In particular, if Valgrind aborted or crashed after identifying problems in your program, there's a good chance that fixing those problems will prevent Valgrind aborting or crashing, especially if it happened in m_mallocfree.c. If that doesn't help, please report this bug to: www.valgrind.org In the bug report, send all the above text, the valgrind version, and what OS and version you are using. Thanks. and yes, I re-ran autogen.sh, configure, (compiling with clang) and rebuilt it all. -rwxr-xr-x 1 briggs briggs 562723 Apr 4 12:36 configure -rwxr-xr-x 1 briggs briggs 170925 Apr 4 12:03 configure.ac
OK thanks. In that case I have an idea. Back to https://github.com/paulfloyd/freebsd_valgrind/issues/137, the flow is signal raised -> pth_sighandler -> user sighandler -> return to pth_sighandler -> call sigreturn -> sigreturn fails -> annoying kernel message -> return to Valgrind trampiline -> syscall fake_sigreturn -> back to VEX CPU but what would happen if "sigreturn fails" doesn't fail? That's really bad news and will almost certainly result in a crash. We need it to fail in order to get back to fake_sigreturn. I'll try to see what happens if I do something like set bit 4 of mc_flags in the synthesized ucontext to ensure that sigreturn fails in the mc_flags check (which is the least-worst option at the moment).
Nick, could you try applying the following diff diff --git a/coregrind/m_syswrap/syswrap-x86-freebsd.c b/coregrind/m_syswrap/syswrap-x86-freebsd.c index 0ba304952..1283f637b 100644 --- a/coregrind/m_syswrap/syswrap-x86-freebsd.c +++ b/coregrind/m_syswrap/syswrap-x86-freebsd.c @@ -863,7 +863,7 @@ static void fill_mcontext(ThreadState *tst, struct vki_mcontext *sc) sc->es = tst->arch.vex.guest_ES; sc->fs = tst->arch.vex.guest_FS; sc->gs = tst->arch.vex.guest_GS; - sc->eflags = LibVEX_GuestX86_get_eflags(&tst->arch.vex); + sc->eflags = LibVEX_GuestX86_get_eflags(&tst->arch.vex) | 0x400000; /* not yet. VG_(memcpy)(&sc->fpstate, fpstate, sizeof(*fpstate)); 0x400000 is not used, but it is also not an allowed user flag that can change. Does this prevent the crash? Running x86 on amd64 I get pid 3067 (memcheck-x86-freebs): freebsd32_sigreturn eflags = 0x200091
It generates plenty of pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200004 pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200094 pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200094 pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084 pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084 pid 79880 (memcheck-x86-freebs): sigreturn eflags = 0x200084 but it still crashes with SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 ) --> [async] ... SYSCALL[79880,1](209) ... [async] --> Success(0x1) SYSCALL[79880,1](121) sys_writev ( 7, 0xfbbfe6dc, 3 ) --> [async] ... SYSCALL[79880,1](121) ... [async] --> Success(0x48) SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 ) --> [async] ... SYSCALL[79880,1](209) ... [async] --> Success(0x1) SYSCALL[79880,1](121) sys_writev ( 7, 0xfbbfe6dc, 3 ) --> [async] ... SYSCALL[79880,1](121) ... [async] --> Success(0x48) SYSCALL[79880,1]( 27) sys_recvmsg ( 7, 0xfbbfe5ec, 0 ) --> [async] ... SYSCALL[79880,1]( 27) ... [async] --> Failure(0x23) SYSCALL[79880,1](209) sys_poll ( 0xfbbfe5e0, 1, -1 ) --> [async] ... valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==79880== at 0x38116979: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==79880== by 0x38116C56: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 209 (lwpid 100159) ==79880== at 0x754AD3F: _sigprocmask (in /lib/libc.so.7) ==79880== by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==79880== by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==79880== by 0x381A05B3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==79880== by 0x7548F3E: poll (in /lib/libc.so.7) ==79880== by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==79880== by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0) ==79880== by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0) ==79880== by 0x73B6CE3: _XFlush (in /usr/local/lib/libX11.so.6.4.0) ==79880== by 0x7399439: XFlush (in /usr/local/lib/libX11.so.6.4.0) ==79880== by 0x446379: clipping_Xbitblt (xbbt.c:55) ==79880== by 0x436ECD: flush_display_lineregion (initdsp.c:346) ==79880== by 0x4194B0: newbltchar (bbtsub.c:1424) ==79880== by 0x429F67: OP_subrcall (subr.c:374) ==79880== by 0x41FE75: dispatch (xc.c:653) ==79880== by 0x43468B: start_lisp (main.c:612) ==79880== by 0x43428D: main (main.c:559) client stack range: [0xFBBFB000 0xFBBFEFFF] client SP: 0xFBBFD8D0 valgrind stack range: [0x52C6000 0x53C5FFF] top usage: 6488 of 1048576
I was beginning to suspect so. The crash site seems to be __sys_sigprocmask(SIG_SETMASK, &actp->sa_mask, NULL); So I assume that is somehow getting a junk pointer for the mask. Need to look at how __libc_sigaction_slot works.
Or more accurately, the guest seems to be performing (or just completed) sigprocmask but the host is starting to make a new syscall. The assert triggers because the host is expecting the syscall state to be idle but it mist be something else (complete or hand to kernel, or a corrupt value). Could you add something like VG_(printf)("DEBUG: %s %d\n", __func__, sci->status.what); just before the assert on line 2130 of m_syswrap/syswrap-main.c and report back as to whether 'what' is 1, 2 or junk?
It's 2 (SsHandToKernel) when it crashes.
That means that it is trying to start a new syscall whilst another is in flight. In the logs I see poll() then sigprocmask(). But that doesn't mean that the new syscall is sigprocmask, it could be some other syscall caused by an async signal. How easy would it be for me to build maiko? I mainly use VirtualBox and just a shell on platforms other than 13.0-RELEASE amd64. Otherwise, could you modify syswrap-main.c again vg_assert(sci->status.what == SsIdle); getSyscallArgsFromGuestState( &sci->orig_args, &tst->arch.vex, trc ); /* Copy .orig_args to .args. The pre-handler may modify .args, but we want to keep the originals too, just in case. */ sci->args = sci->orig_args; /* Save the syscall number in the thread state in case the syscall is interrupted by a signal. */ sysno = sci->orig_args.sysno; -> put a VG_(printf) of sysno here, and move the assert after the printf
It's easier for you to build maiko now than it used to be. See https://github.com/Interlisp/maiko but you also need a medley Interlisp release, https://github.com/interlisp/medley -- and it gets a bit tricky to get to the right place to cause this trouble. I'll go ahead and make the changes you asked for and report back.
DEBUG: vgPlain_client_syscall sysno 121 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 116 DEBUG: vgPlain_client_syscall sysno 116 DEBUG: vgPlain_client_syscall sysno 116 DEBUG: vgPlain_client_syscall sysno 240 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 116 DEBUG: vgPlain_client_syscall sysno 209 DEBUG: vgPlain_client_syscall sysno 121 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 209 DEBUG: vgPlain_client_syscall sysno 121 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 209 DEBUG: vgPlain_client_syscall sysno 121 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 209 DEBUG: vgPlain_client_syscall sysno 121 DEBUG: vgPlain_client_syscall sysno 27 DEBUG: vgPlain_client_syscall sysno 209 DEBUG: vgPlain_client_syscall 2 DEBUG: vgPlain_client_syscall sysno 340 valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==94241== at 0x381169A9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==94241== by 0x38116C86: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 340 (lwpid 100133) ==94241== at 0x754AD3F: _sigprocmask (in /lib/libc.so.7) ==94241== by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==94241== by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==94241== by 0x381A05E3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==94241== by 0x7548F3E: poll (in /lib/libc.so.7) ==94241== by 0x7768659: ??? (in /usr/local/lib/libxcb.so.1.1.0)
I turned back on the trace syscalls and ran it again (also fixed up the msg about the state) -- DEBUG: vgPlain_client_syscall sysno 209 SYSCALL[94860,1](209) sys_poll ( 0xfbbfe5c0, 1, -1 ) --> [async] ... SYSCALL[94860,1](209) ... [async] --> Success(0x1) DEBUG: vgPlain_client_syscall sysno 121 SYSCALL[94860,1](121) sys_writev ( 7, 0xfbbfe6bc, 3 ) --> [async] ... SYSCALL[94860,1](121) ... [async] --> Success(0x48) DEBUG: vgPlain_client_syscall sysno 27 SYSCALL[94860,1]( 27) sys_recvmsg ( 7, 0xfbbfe5cc, 0 ) --> [async] ... SYSCALL[94860,1]( 27) ... [async] --> Failure(0x23) DEBUG: vgPlain_client_syscall sysno 209 SYSCALL[94860,1](209) sys_poll ( 0xfbbfe5c0, 1, -1 ) --> [async] ... SYSCALL[94860,1](209) ... [async] --> Success(0x1) DEBUG: vgPlain_client_syscall sysno 121 SYSCALL[94860,1](121) sys_writev ( 7, 0xfbbfe6bc, 3 ) --> [async] ... DEBUG: vgPlain_client_syscall sci->status.what 2 DEBUG: vgPlain_client_syscall sysno 340 valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==94860== at 0x381169B9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==94860== by 0x38116C96: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 340 (lwpid 100133) ==94860== at 0x754AD3F: _sigprocmask (in /lib/libc.so.7) ==94860== by 0x77A32F9: ??? (src/lib/libthr/thread/thr_sig.c:288) ==94860== by 0x77A28BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==94860== by 0x381A05F3: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==94860== by 0x754939E: writev (in /lib/libc.so.7) ==94860== by 0x7768808: ??? (in /usr/local/lib/libxcb.so.1.1.0) ==94860== by 0x776993E: xcb_writev (in /usr/local/lib/libxcb.so.1.1.0) ==94860== by 0x73B6452: _XSend (in /usr/local/lib/libX11.so.6.4.0)
In case it matters... I just looked at the maiko code that sets up the VTALRM signal, notice the SA_RESTART in the flags. struct itimerval timert; struct sigaction timer_action; timer_action.sa_handler = int_timer_service; sigemptyset(&timer_action.sa_mask); timer_action.sa_flags = SA_RESTART; if (sigaction(SIGVTALRM, &timer_action, NULL) == -1) { perror("sigaction: SIGVTALRM"); } /* then attach a timer to it and turn it loose */ timert.it_interval.tv_sec = timert.it_value.tv_sec = 0; timert.it_interval.tv_usec = timert.it_value.tv_usec = TIMER_INTERVAL; setitimer(ITIMER_VIRTUAL, &timert, NULL);
Red herring. Removing the SA_RESTART doesn't change anything.
Created attachment 148033 [details] C source test case This test case provokes the problem. cc -O2 -pipe vgtest2.c -o vgtest2 Running under valgrind: [...] DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sci->status.what 2 DEBUG: vgPlain_client_syscall sysno 1000 valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==95237== at 0x381169B9: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==95237== by 0x38116C96: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 1000 (lwpid 100137) ==95237== at 0x381A0601: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==95237== by 0x4018FD: main (in /var/tmp/vgtest2) client stack range: [0xFBBFC000 0xFBBFEFFF] client SP: 0xFBBFE1E8 valgrind stack range: [0x51BE000 0x52BDFFF] top usage: 5792 of 1048576
I just double checked that there are no problems with https://bugs.kde.org/show_bug.cgi?id=445032 which is fairly similar. Works OK with i386/ amd64, no pthread and with pthread. This testcase _fails_ on all 4 combinations. Here, amd64 no pthread valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==39165== at 0x38106032: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd) ==39165== by 0x402AA9FDF: ??? ==39165== by 0x7C: ??? ==39165== by 0x38106031: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd) ==39165== by 0x402AA984F: ??? ==39165== by 0x40200838F: ??? sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 4 (lwpid 102465) ==39165== at 0x3819A5D8: ??? (in /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd) ==39165== by 0x2017AF: (below main) (src/lib/csu/amd64/crt1_c.c:75) client stack range: [0x7FBFFC000 0x7FC000FFF] client SP: 0x7FBFFF120 valgrind stack range: [0x4029AA000 0x402AA9FFF] top usage: 8312 of 1048576 With ktrace 39225 memcheck-amd64-free RET read 1 39225 memcheck-amd64-free CALL thr_self(0x402aa97c8) 39225 memcheck-amd64-free RET thr_self 0 39225 memcheck-amd64-free CALL sigprocmask(SIG_SETMASK,0x402aa9e70,0) 39225 memcheck-amd64-free RET sigprocmask 0 39225 memcheck-amd64-free CALL sigprocmask(SIG_SETMASK,0x402aa9e08,0x402aa9df8) 39225 memcheck-amd64-free RET sigprocmask 0 39225 memcheck-amd64-free CALL sigtimedwait(0x402aa9de8,0x402aa9e08,0x38064a58) 39225 memcheck-amd64-free RET sigtimedwait -1 errno 35 Resource temporarily unavailable 39225 memcheck-amd64-free CALL sigprocmask(SIG_SETMASK,0x402aa9df8,0) 39225 memcheck-amd64-free RET sigprocmask 0 39225 memcheck-amd64-free CALL write(0x72c98,0x402aa99f8,0x87) 39225 memcheck-amd64-free GIO fd 470168 wrote 135 bytes " valgrind: m_syswrap/syswrap-main.c:2130 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed.
So what I think is happening is that a non-blocking write is occurring which results in an immediate return with EAGAIN (Failure(0x23) in the logs). Nevertheless the syscall is being marked as SfMayBlock. m_signals.c has this comment So what we have to do, when doing any syscall which SfMayBlock, is to quickly switch in the SCSS-specified signal mask just before the syscall, and switch it back just afterwards, and hope that we don't get caught up in some weird race condition. This is the primary purpose of the ultra-magical pieces of assembly code in coregrind/m_syswrap/syscall-<plat>.S I'm not sure if this is such a weird race condition, or whether it was just wrong to set SfMayBlock. I tried commenting out *flags |= SfMayBlock; in PRE(sys_write), syswrap-generic.c line 4280. With the the small testcase works. I doubt that this is a real solution, it may well cause hangs in other situations. I then tried to be a bit more clever with int fd_flags = VG_(fcntl(ARG1, VKI_F_GETFL, 0)); if ((fd_flags & VKI_O_NONBLOCK) == 0) *flags |= SfMayBlock; but for fd 2 I just see flags of 0xa (write and append).
I'm not sure why the write() would be flagged non-blocking. You can replace the write(...) in the test with read(0, NULL, 0) and get the same result. I also tried int fd = open("/tmp/dummy", O_WRONLY, O_CREAT); if (fd != -1) close(fd); and got the failure. I wonder if it's possible that it fails with any of the syscalls that are in the list in sigaction's SA_RESTART (although actually *using* SA_RESTART doesn't seem to affect the test case) %%%%% The affected system calls include open(2), read(2), write(2), sendto(2), recvfrom(2), sendmsg(2) and recvmsg(2) on a communications channel or a slow device (such as a terminal, but not a regular file) and during a wait(2) or ioctl(2). However, calls that have already committed are not restarted, but instead return a partial success (for example, a short read count). %%%%% I've made it fail with read(), write(), open(), wait(), but neither stat() nor usleep().
Here's what happens with --trace-signals=yes and using the write(...) as the thing to be interrupted: DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 --12231-- async signal handler: signal=26, vgtid=100118, tid=1, si_code=65542, exitreason VgSrc_None --12231-- interrupted_syscall: tid=1, ip=0x381a1e02, restart=False, sres.isErr=False, sres.val=0 --12231-- not started: restarting --12231-- delivering signal 26 (SIGVTALRM):65542 to thread 1 --12231-- push_signal_frame (thread 1): signal 26 ==12231== at 0x7303DBD: _write (in /lib/libc.so.7) ==12231== by 0x730236E: write (in /lib/libc.so.7) ==12231== by 0x4018ED: main (in /var/tmp/vgtest2) DEBUG: vgPlain_client_syscall sysno 1000 --12231-- VG_(signal_return) (thread 1): EIP=0x7303dbd DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 --12231-- async signal handler: signal=26, vgtid=100118, tid=1, si_code=65542, exitreason VgSrc_None --12231-- interrupted_syscall: tid=1, ip=0x381e4921, restart=False, sres.isErr=False, sres.val=0 --12231-- not in syscall at all: hmm, very suspicious --12231-- delivering signal 26 (SIGVTALRM):65542 to thread 1 --12231-- push_signal_frame (thread 1): signal 26 ==12231== at 0x7303DBF: _write (in /lib/libc.so.7) ==12231== by 0x730236E: write (in /lib/libc.so.7) ==12231== by 0x4018ED: main (in /var/tmp/vgtest2) DEBUG: vgPlain_client_syscall sci->status.what 2 DEBUG: vgPlain_client_syscall sysno 1000 valgrind: m_syswrap/syswrap-main.c:2142 (void vgPlain_client_syscall(ThreadId, UInt)): Assertion 'sci->status.what == SsIdle' failed. host stacktrace: ==12231== at 0x38116B19: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==12231== by 0x38116DF6: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) sched status: running_tid=1 Thread 1: status = VgTs_Runnable syscall 1000 (lwpid 100118) ==12231== at 0x381A0821: ??? (in /usr/home/briggs/valgrind/memcheck/memcheck-x86-freebsd) ==12231== by 0x4018ED: main (in /var/tmp/vgtest2) client stack range: [0xFBBFC000 0xFBBFEFFF] client SP: 0xFBBFE1E8 valgrind stack range: [0x51BE000 0x52BDFFF] top usage: 5792 of 1048576 Where the "very suspicious" message is from coregrind/m_syswrap/syswrap-main.c
I threw in a VG_(printf): @@ -3117,6 +3119,7 @@ VG_(fixup_guest_state_after_syscall_interrupted)( ThreadId tid, if (VG_(clo_trace_signals)) VG_(message)( Vg_DebugMsg, " not in syscall at all: hmm, very suspicious\n" ); + VG_(printf)("DEBUG: blksys_setup %ld ip %ld blksys_finished %ld\n", ML_(blksys_setup), ip, ML_(blksys_finished)); /* Looks like we weren't in a syscall at all. Hmm. */ vg_assert(sci->status.what != SsIdle); return; and see: DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: vgPlain_client_syscall sysno 4 DEBUG: blksys_setup 941235726 ip 941508929 blksys_finished 941235863 DEBUG: vgPlain_client_syscall sci->status.what 2 DEBUG: vgPlain_client_syscall sysno 1000 and that ip is in x86g_calculate_eflags_all_WRK !
The write man page says [EAGAIN] The file was marked for non-blocking I/O, and no data could be written immediately. So I was assuming that fd 2 to the terminal was nonblocking.
(In reply to Nick Briggs from comment #22) > I threw in a VG_(printf): > > @@ -3117,6 +3119,7 @@ VG_(fixup_guest_state_after_syscall_interrupted)( > ThreadId tid, > if (VG_(clo_trace_signals)) > VG_(message)( Vg_DebugMsg, > " not in syscall at all: hmm, very suspicious\n" ); > + VG_(printf)("DEBUG: blksys_setup %ld ip %ld blksys_finished %ld\n", > ML_(blksys_setup), ip, ML_(blksys_finished)); > /* Looks like we weren't in a syscall at all. Hmm. */ > vg_assert(sci->status.what != SsIdle); > return; > > and see: > > DEBUG: vgPlain_client_syscall sysno 4 > DEBUG: vgPlain_client_syscall sysno 4 > DEBUG: vgPlain_client_syscall sysno 4 > DEBUG: vgPlain_client_syscall sysno 4 > DEBUG: blksys_setup 941235726 ip 941508929 blksys_finished 941235863 > DEBUG: vgPlain_client_syscall sci->status.what 2 > DEBUG: vgPlain_client_syscall sysno 1000 > > and that ip is in x86g_calculate_eflags_all_WRK ! Very interesting.
The flags seem to be used. Now how to determine if we are in either of those functions? As a quick hack for testing putting #if defined(VGO_freebsd) if (outside_range) { outside_range = False; in_complete_to_committed = True; } #endif just before the check for outside_range seems to make things work.
> Now how to determine if we are in either of those functions? I think it's actually any of 4 functions: 3805cd30 R vgModuleLocal_blksys_setup 3805cd34 R vgModuleLocal_blksys_restart 3805cd38 R vgModuleLocal_blksys_complete 3805cd3c R vgModuleLocal_blksys_committed 3805cd40 R vgModuleLocal_blksys_finished It's checking outside_range = ip < ML_(blksys_setup) || ip >= ML_(blksys_finished) so being in any of blksys_setup, blksys_restart, blksys_complete, blksys_committed is supposed to be OK. 40 years ago I had to debug this sort of junk in my own assembler code with an in-circuit emulator. Things haven't improved much, have they!
Didn't make it into 3.19.0, but this should now be OK paulf> git push Enumerating objects: 40, done. Counting objects: 100% (40/40), done. Delta compression using up to 4 threads Compressing objects: 100% (22/22), done. Writing objects: 100% (24/24), 3.95 KiB | 3.95 MiB/s, done. Total 24 (delta 17), reused 0 (delta 0), pack-reused 0 To ssh://sourceware.org/git/valgrind.git 68ac5ca0e..3e7774ea5 master -> master
Looks good. Thanks! I'd suggest, though, using something other than " " in the test (write(2, " ", 1);) so that it's more obvious if the output doesn't match, rather than trying to count spaces ;-) Now on to the next problem where the X library chokes sometimes when SIGVTALRMs go off -- not sure that it's just valgrind though.
I made 2 changes to the testcase. Reduced the number of ticks to make the runtime a lot shorter. And added a stderr filter to remove all those spaces. Without the stderr filter, the filter is unnecessarily large, and the testcase also non-determinisitic. I think that the important thing is to be spamming a fairly slow syscall like `write` whilst having the timer firing ang generating interrupts.