Created attachment 143251 [details] C test case for valgrind segmentation fault SUMMARY A program which uses an interval timer (e.g., ITIMER_VIRTUAL w/ SIGVTALRM handler), linked with libthr.so, will take a SIGSEGV when run under valgrind STEPS TO REPRODUCE 1. Compile provided sample, vgtest.c -- cc is FreeBSD clang version 11.0.1 cc -pthread /var/tmp/vgtest.c -o /var/tmp/vgtest 2. Run it under valgrind valgrind /var/tmp/vgtest 3. Compiling the same test case without -pthread option runs without error. OBSERVED RESULT $ ldd /var/tmp/vgtest /var/tmp/vgtest: libthr.so.3 => /lib/libthr.so.3 (0x20442000) libc.so.7 => /lib/libc.so.7 (0x2046b000) $ valgrind /var/tmp/vgtest ==28547== Memcheck, a memory error detector ==28547== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==28547== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==28547== Command: /var/tmp/vgtest ==28547== ==28547== Invalid read of size 4 ==28547== at 0x71FFB9B: ??? (in /lib/libthr.so.3) ==28547== by 0x71FF16F: ??? (in /lib/libthr.so.3) ==28547== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==28547== by 0x72B973E: sleep (in /lib/libc.so.7) ==28547== by 0x4018F2: main (in /var/tmp/vgtest) ==28547== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==28547== ==28547== ==28547== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==28547== Access not within mapped region at address 0x0 ==28547== at 0x71FFB9B: ??? (in /lib/libthr.so.3) ==28547== by 0x71FF16F: ??? (in /lib/libthr.so.3) ==28547== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==28547== by 0x72B973E: sleep (in /lib/libc.so.7) ==28547== by 0x4018F2: main (in /var/tmp/vgtest) ==28547== If you believe this happened as a result of a stack ==28547== overflow in your program's main thread (unlikely but ==28547== possible), you can try to increase the size of the ==28547== main thread stack using the --main-stacksize= flag. ==28547== The main thread stack size used in this run was 16777216. ==28547== ==28547== HEAP SUMMARY: ==28547== in use at exit: 724 bytes in 2 blocks ==28547== total heap usage: 2 allocs, 0 frees, 724 bytes allocated ==28547== ==28547== LEAK SUMMARY: ==28547== definitely lost: 0 bytes in 0 blocks ==28547== indirectly lost: 0 bytes in 0 blocks ==28547== possibly lost: 0 bytes in 0 blocks ==28547== still reachable: 724 bytes in 2 blocks ==28547== suppressed: 0 bytes in 0 blocks ==28547== Rerun with --leak-check=full to see details of leaked memory ==28547== ==28547== For lists of detected and suppressed errors, rerun with: -s ==28547== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 1 from 1) Segmentation fault $ EXPECTED RESULT $ valgrind /var/tmp/vgtest ==28579== Memcheck, a memory error detector ==28579== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==28579== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==28579== Command: /var/tmp/vgtest ==28579== ==28579== ==28579== HEAP SUMMARY: ==28579== in use at exit: 0 bytes in 0 blocks ==28579== total heap usage: 0 allocs, 0 frees, 0 bytes allocated ==28579== ==28579== All heap blocks were freed -- no leaks are possible ==28579== ==28579== For lists of detected and suppressed errors, rerun with: -s ==28579== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1) SOFTWARE/OS VERSIONS $ uname -a FreeBSD flap.gateway.sonic.net 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 18:58:48 UTC 2021 root@amd64-builder.daemonology.net:/usr/obj/usr/src/i386.i386/sys/GENERIC i386 $ valgrind --version valgrind-3.18.1 (also occurs with valgrind compiled from the latest git sources, commit 3950c5d661ee09526cddcf24daf5fc22bc83f70c) ADDITIONAL INFORMATION Your software version fields could do with an update -- most recent listed is 3.15 but valgrind is at 3.18.1 released, and 3.19.0 in git. Might be related to https://github.com/paulfloyd/freebsd_valgrind/issues/137
I can't reproduce with this testcase using FreeBSD 13.0 in a VirtualBox vm.
I was running this on CPU: Intel(R) Pentium(R) M processor 1.70GHz (966.40-MHz 686-class CPU) Origin="GenuineIntel" Id=0x6d6 Family=0x6 Model=0xd Stepping=6 Features=0xafe9f9bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,TM,PBE> Features2=0x180<EST,TM2> I can try the VirtualBox route also and see if it's reproducible here, on different Intel hardware -- was your test case on AMD or Intel?
It doesn't fail in a VirtualBox running FreeBSD 13.0 on hardware that reports out as: CPU: Intel(R) Core(TM)2 Duo CPU T9800 @ 2.93GHz (2918.76-MHz 686-class CPU) Origin="GenuineIntel" Id=0x1067a Family=0x6 Model=0x17 Stepping=10 Features=0x1783fbbf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT> Features2=0x82209<SSE3,MON,SSSE3,CX16,SSE4.1> AMD Features2=0x1<LAHF>
Looks like we are both using fairly old kit. My VM has CPU: Intel(R) Xeon(R) CPU W3520 @ 2.67GHz (2666.79-MHz 686-class CPU) Origin="GenuineIntel" Id=0x106a5 Family=0x6 Model=0x1a Stepping=5 Features=0x1783fbbf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT> Features2=0x182209<SSE3,MON,SSSE3,CX16,SSE4.1,SSE4.2> AMD Features=0x8000000<RDTSCP> AMD Features2=0x1<LAHF> TSC: P-state invariant
paulf@freebsd:~/scratch/sigreturn $ clang -o pthread_sigreturn_clang -pthread sigreturn.c paulf@freebsd:~/scratch/sigreturn $ valgrind ./pthread_sigreturn_clang ==1910== Memcheck, a memory error detector ==1910== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==1910== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info ==1910== Command: ./pthread_sigreturn_clang ==1910== ==1910== ==1910== HEAP SUMMARY: ==1910== in use at exit: 728 bytes in 2 blocks ==1910== total heap usage: 2 allocs, 0 frees, 728 bytes allocated ==1910== ==1910== LEAK SUMMARY: ==1910== definitely lost: 0 bytes in 0 blocks ==1910== indirectly lost: 0 bytes in 0 blocks ==1910== possibly lost: 0 bytes in 0 blocks ==1910== still reachable: 728 bytes in 2 blocks ==1910== suppressed: 0 bytes in 0 blocks ==1910== Rerun with --leak-check=full to see details of leaked memory ==1910== ==1910== For lists of detected and suppressed errors, rerun with: -s ==1910== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Time to build myself an i386 kernel
paulf@freebsd:~/scratch/sigreturn $ valgrind ./pthread_sigreturn_clang ==866== Memcheck, a memory error detector ==866== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==866== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info ==866== Command: ./pthread_sigreturn_clang ==866== ==866== Invalid read of size 4 ==866== at 0x720526B: ??? (in /lib/libthr.so.3) ==866== by 0x72048BD: ??? (in /lib/libthr.so.3) ==866== by 0x381A64F3: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==866== by 0x72B973E: sleep (in /lib/libc.so.7) ==866== by 0x4018F2: main (in /usr/home/paulf/scratch/sigreturn/pthread_sigreturn_clang) ==866== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==866== ==866== ==866== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==866== Access not within mapped region at address 0x0 ==866== at 0x720526B: ??? (in /lib/libthr.so.3) ==866== by 0x72048BD: ??? (in /lib/libthr.so.3) ==866== by 0x381A64F3: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==866== by 0x72B973E: sleep (in /lib/libc.so.7) ==866== by 0x4018F2: main (in /usr/home/paulf/scratch/sigreturn/pthread_sigreturn_clang) ==866== If you believe this happened as a result of a stack ==866== overflow in your program's main thread (unlikely but ==866== possible), you can try to increase the size of the ==866== main thread stack using the --main-stacksize= flag. ==866== The main thread stack size used in this run was 16777216. To get this I changed the ASLR sysctls paulf@freebsd:~/scratch/sigreturn $ sysctl -a | grep -i aslr kern.elf32.aslr.stack_gap: 0 (default is 3) kern.elf32.aslr.honor_sbrk: 1 kern.elf32.aslr.pie_enable: 1 (default is 0) kern.elf32.aslr.enable: 1 (default is 0) vm.aslr_restarts: 0 Are yoiu using ASLR as above?
No ASLR in use here -- I've got the default settings: ``` $ sysctl -a | grep aslr kern.elf32.aslr.stack_gap: 3 kern.elf32.aslr.honor_sbrk: 1 kern.elf32.aslr.pie_enable: 0 kern.elf32.aslr.enable: 0 vm.aslr_restarts: 0 ```
I compiled a debug version of libthr.so.3 which produces this stack trace on failure: ``` $ valgrind ./vgtest-thr ==56575== Memcheck, a memory error detector ==56575== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==56575== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==56575== Command: ./vgtest-thr ==56575== ==56575== Invalid read of size 4 ==56575== at 0x720526B: ??? (src/lib/libthr/thread/thr_sig.c:262) ==56575== by 0x72048BD: ??? (src/lib/libthr/thread/thr_sig.c:246) ==56575== by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd) ==56575== by 0x73B973E: sleep (in /lib/libc.so.7) ==56575== by 0x4018F2: main (in /var/tmp/vgtest-thr) ==56575== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==56575== ``` thr_sig.c:246 is within `thr_sighandler(int sig, siginfo_t *info, void *_ucp)` where it calls `handle_signal(&act, sig, info, ucp);` which immediately dies on ``` /* add previous level mask */ SIGSETOR(actp->sa_mask, ucp->uc_sigmask); ```
I'm not sure why, but I only intermittently get VTALRM signals when running under gdb. thr_sig.c:246 is in static void thr_sighandler(int sig, siginfo_t *info, void *_ucp) Looking at the Valgrind source, this is "called" here (in m_signals.c) /* Create a signal delivery frame, and set the client's %ESP and %EIP so that when execution continues, we will enter the signal handler with the frame on top of the client's stack, as it expects. Signal delivery can fail if the client stack is too small or missing, and we can't push the frame. If that happens, push_signal_frame will cause the whole process to exit when we next hit the scheduler. */ vg_assert(VG_(is_valid_tid)(tid)); push_signal_frame ( tid, info, uc ); So rather than calling the signal handler, Valgrind is directly manipulating the client stack and then just resuming execution. In gdb, I can get some useful debugging done with the separate debuginfo: (gdb) set directories /usr/src/lib/libthr/thread and (gdb) add-symbol-file /usr/lib/debug/lib/libthr.so.3.debug (they don't seem to be searched for by default on my amd64 system) Here's the trampoline for the syscall to sigreturn │ > 0xffffe194 lea 0x20(%esp),%eax │ 0xffffe198 push %eax │ 0xffffe199 mov $0x1a1,%eax │ 0xffffe19e push %eax │ 0xffffe19f int $0x80 I'm a bit confused as to why the above (calling syscall 417 sigreturn) then calls thr_sighandler. I would have thought that the handler would be called first and then sigreturn after the user code. Maybe it's used for both. I see that the arguments are passed to thr_sighandler in ebx/edi/esi Something similar can be seen in i386/i386/sigtramp.s /* * Signal trampoline, copied to top of user stack */ NON_GPROF_ENTRY(sigcode) calll *SIGF_HANDLER(%esp) leal SIGF_UC(%esp),%eax /* get ucontext */ pushl %eax testl $PSL_VM,UC_EFLAGS(%eax) jne 1f mov UC_GS(%eax),%gs /* restore %gs */ 1: movl $SYS_sigreturn,%eax pushl %eax /* junk to fake return addr. */ int $0x80 /* enter kernel with args */ /* on stack */ where ASSYM(SIGF_UC, offsetof(struct sigframe, sf_uc)); SIGF_UC will be 20 (0x14) For comparison in Valgrind lea 0x1c(%esp), %eax /* args to sigreturn(ucontext_t *) */ pushl %eax pushl %eax /* fake return addr */ movl $__NR_fake_sigreturn, %eax int $0x80 The Valgrtind version of 'struct sigframe' has an extra Address member at the beginning. That gives us an offset of 24 (0x18), but there is also all of the fiddling with esp that I do in build_sigframe which is where another 4 gets added [24 0x16]. I think.
And now I think that I have a fix, at least for the crash diff --git a/coregrind/m_sigframe/sigframe-x86-freebsd.c b/coregrind/m_sigframe/sigframe-x86-freebsd.c index a1d8638e5..12f51e385 100644 --- a/coregrind/m_sigframe/sigframe-x86-freebsd.c +++ b/coregrind/m_sigframe/sigframe-x86-freebsd.c @@ -304,6 +304,8 @@ static Addr build_sigframe(ThreadState *tst, err = 0; } + frame->puContext = (Addr)&frame->uContext; + synth_ucontext(tst->tid, siginfo, trapno, err, mask, &frame->uContext, &frame->fpstate);
Fixed with commit 4b8eddfde14291e288a7017edce5c7225e1533d6 Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Tue Nov 9 23:11:15 2021 +0100 However I do now get similar kernel messages to those mentioned in the linked github issue. That's another problem though.
I've rebuilt based on the latest git version, and concur that the SIGSEGV is fixed. Now it just offers up a continuous stream of "sigreturn eflags..." % grep memcheck-x86 /tmp/out | sort | uniq -c 93 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200000 77 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200004 46 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200010 4 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200011 31 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200014 43 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200044 4 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200055 224 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200080 12 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200081 240 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200084 52 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200085 20 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200090 12 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200091 20 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200094 82 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200095 70 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200801 128 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200805
The kernel messages are because now it 'works as unintended'. It's a bit annoying as this is the kernel writing to the console so you can't redirect it to /dev/null. It's also a bit delicate because if the eflags test gets fixed then the following CS test is do-or-die and will cause a SIGBUS if it fails rather than just failing with a kernel message.
(In reply to Paul Floyd from comment #13) > The kernel messages are because now it 'works as unintended'. > > It's a bit annoying as this is the kernel writing to the console so you > can't redirect it to /dev/null. It's also a bit delicate because if the > eflags test gets fixed then the following CS test is do-or-die and will > cause a SIGBUS if it fails rather than just failing with a kernel message. Perhaps I should put in a request for a FreeBSD kernel option... sysctl kern.eflags.stfu=1 so one could acknowledge that one knows, but the kernel should stop whinging about it. ;-)