445032 – valgrind/memcheck crash with SIGSEGV when SIGVTALRM timer used and libthr.so associated

Bug 445032 - valgrind/memcheck crash with SIGSEGV when SIGVTALRM timer used and libthr.so associated

Summary: valgrind/memcheck crash with SIGSEGV when SIGVTALRM timer used and libthr.so ...

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	memcheck (other bugs)
Version First Reported In:	unspecified
Platform:	FreeBSD Ports FreeBSD

Importance:	NOR crash
Target Milestone:	---
Assignee:	Paul Floyd

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-11-05 17:11 UTC by Nick Briggs
Modified:	2021-11-10 16:10 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
C test case for valgrind segmentation fault (648 bytes, text/x-csrc) 2021-11-05 17:11 UTC, Nick Briggs	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Nick Briggs 2021-11-05 17:11:58 UTC

Created attachment 143251 [details]
C test case for valgrind segmentation fault

SUMMARY
A program which uses an interval timer (e.g., ITIMER_VIRTUAL w/ SIGVTALRM handler), linked with libthr.so, will take a SIGSEGV when run under valgrind

STEPS TO REPRODUCE
1. Compile provided sample, vgtest.c -- cc is FreeBSD clang version 11.0.1
     cc -pthread /var/tmp/vgtest.c  -o /var/tmp/vgtest
2. Run it under valgrind
     valgrind /var/tmp/vgtest
3. Compiling the same test case without -pthread option runs without error.

OBSERVED RESULT
$ ldd /var/tmp/vgtest
/var/tmp/vgtest:
	libthr.so.3 => /lib/libthr.so.3 (0x20442000)
	libc.so.7 => /lib/libc.so.7 (0x2046b000)
$ valgrind /var/tmp/vgtest
==28547== Memcheck, a memory error detector
==28547== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28547== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==28547== Command: /var/tmp/vgtest
==28547== 
==28547== Invalid read of size 4
==28547==    at 0x71FFB9B: ??? (in /lib/libthr.so.3)
==28547==    by 0x71FF16F: ??? (in /lib/libthr.so.3)
==28547==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==28547==    by 0x72B973E: sleep (in /lib/libc.so.7)
==28547==    by 0x4018F2: main (in /var/tmp/vgtest)
==28547==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==28547== 
==28547== 
==28547== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==28547==  Access not within mapped region at address 0x0
==28547==    at 0x71FFB9B: ??? (in /lib/libthr.so.3)
==28547==    by 0x71FF16F: ??? (in /lib/libthr.so.3)
==28547==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==28547==    by 0x72B973E: sleep (in /lib/libc.so.7)
==28547==    by 0x4018F2: main (in /var/tmp/vgtest)
==28547==  If you believe this happened as a result of a stack
==28547==  overflow in your program's main thread (unlikely but
==28547==  possible), you can try to increase the size of the
==28547==  main thread stack using the --main-stacksize= flag.
==28547==  The main thread stack size used in this run was 16777216.
==28547== 
==28547== HEAP SUMMARY:
==28547==     in use at exit: 724 bytes in 2 blocks
==28547==   total heap usage: 2 allocs, 0 frees, 724 bytes allocated
==28547== 
==28547== LEAK SUMMARY:
==28547==    definitely lost: 0 bytes in 0 blocks
==28547==    indirectly lost: 0 bytes in 0 blocks
==28547==      possibly lost: 0 bytes in 0 blocks
==28547==    still reachable: 724 bytes in 2 blocks
==28547==         suppressed: 0 bytes in 0 blocks
==28547== Rerun with --leak-check=full to see details of leaked memory
==28547== 
==28547== For lists of detected and suppressed errors, rerun with: -s
==28547== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 1 from 1)
Segmentation fault
$ 

EXPECTED RESULT
$ valgrind /var/tmp/vgtest
==28579== Memcheck, a memory error detector
==28579== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28579== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==28579== Command: /var/tmp/vgtest
==28579== 
==28579== 
==28579== HEAP SUMMARY:
==28579==     in use at exit: 0 bytes in 0 blocks
==28579==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==28579== 
==28579== All heap blocks were freed -- no leaks are possible
==28579== 
==28579== For lists of detected and suppressed errors, rerun with: -s
==28579== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 1 from 1)

SOFTWARE/OS VERSIONS
$ uname -a
FreeBSD flap.gateway.sonic.net 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 18:58:48 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/i386.i386/sys/GENERIC  i386
$ valgrind --version
valgrind-3.18.1

(also occurs with valgrind compiled from the latest git sources, commit 3950c5d661ee09526cddcf24daf5fc22bc83f70c)

ADDITIONAL INFORMATION
Your software version fields could do with an update -- most recent listed is 3.15 but valgrind is at 3.18.1 released, and 3.19.0 in git.

Might be related to https://github.com/paulfloyd/freebsd_valgrind/issues/137

Comment 1 Paul Floyd 2021-11-07 16:14:30 UTC

I can't reproduce with this testcase using FreeBSD 13.0 in a VirtualBox vm.

Comment 2 Nick Briggs 2021-11-07 17:56:37 UTC

I was running this on

CPU: Intel(R) Pentium(R) M processor 1.70GHz (966.40-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0x6d6  Family=0x6  Model=0xd  Stepping=6
  Features=0xafe9f9bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,TM,PBE>
  Features2=0x180<EST,TM2>

I can try the VirtualBox route also and see if it's reproducible here, on different Intel hardware -- was your test case on AMD or Intel?

Comment 3 Nick Briggs 2021-11-07 18:27:29 UTC

It doesn't fail in a VirtualBox running FreeBSD 13.0 on hardware that reports out as:

CPU: Intel(R) Core(TM)2 Duo CPU     T9800  @ 2.93GHz (2918.76-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0x1067a  Family=0x6  Model=0x17  Stepping=10
  Features=0x1783fbbf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x82209<SSE3,MON,SSSE3,CX16,SSE4.1>
  AMD Features2=0x1<LAHF>

Comment 4 Paul Floyd 2021-11-07 18:53:17 UTC

Looks like we are both using fairly old kit. My VM  has

CPU: Intel(R) Xeon(R) CPU           W3520  @ 2.67GHz (2666.79-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0x106a5  Family=0x6  Model=0x1a  Stepping=5
  Features=0x1783fbbf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x182209<SSE3,MON,SSSE3,CX16,SSE4.1,SSE4.2>
  AMD Features=0x8000000<RDTSCP>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant

Comment 5 Paul Floyd 2021-11-07 19:03:45 UTC

paulf@freebsd:~/scratch/sigreturn $ clang -o pthread_sigreturn_clang -pthread sigreturn.c
paulf@freebsd:~/scratch/sigreturn $ valgrind ./pthread_sigreturn_clang
==1910== Memcheck, a memory error detector
==1910== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1910== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==1910== Command: ./pthread_sigreturn_clang
==1910== 
==1910== 
==1910== HEAP SUMMARY:
==1910==     in use at exit: 728 bytes in 2 blocks
==1910==   total heap usage: 2 allocs, 0 frees, 728 bytes allocated
==1910== 
==1910== LEAK SUMMARY:
==1910==    definitely lost: 0 bytes in 0 blocks
==1910==    indirectly lost: 0 bytes in 0 blocks
==1910==      possibly lost: 0 bytes in 0 blocks
==1910==    still reachable: 728 bytes in 2 blocks
==1910==         suppressed: 0 bytes in 0 blocks
==1910== Rerun with --leak-check=full to see details of leaked memory
==1910== 
==1910== For lists of detected and suppressed errors, rerun with: -s
==1910== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Time to build myself an i386 kernel

Comment 6 Paul Floyd 2021-11-08 07:27:30 UTC

paulf@freebsd:~/scratch/sigreturn $ valgrind ./pthread_sigreturn_clang
==866== Memcheck, a memory error detector
==866== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==866== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==866== Command: ./pthread_sigreturn_clang
==866== 
==866== Invalid read of size 4
==866==    at 0x720526B: ??? (in /lib/libthr.so.3)
==866==    by 0x72048BD: ??? (in /lib/libthr.so.3)
==866==    by 0x381A64F3: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==866==    by 0x72B973E: sleep (in /lib/libc.so.7)
==866==    by 0x4018F2: main (in /usr/home/paulf/scratch/sigreturn/pthread_sigreturn_clang)
==866==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==866== 
==866== 
==866== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==866==  Access not within mapped region at address 0x0
==866==    at 0x720526B: ??? (in /lib/libthr.so.3)
==866==    by 0x72048BD: ??? (in /lib/libthr.so.3)
==866==    by 0x381A64F3: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==866==    by 0x72B973E: sleep (in /lib/libc.so.7)
==866==    by 0x4018F2: main (in /usr/home/paulf/scratch/sigreturn/pthread_sigreturn_clang)
==866==  If you believe this happened as a result of a stack
==866==  overflow in your program's main thread (unlikely but
==866==  possible), you can try to increase the size of the
==866==  main thread stack using the --main-stacksize= flag.
==866==  The main thread stack size used in this run was 16777216.


To get this I changed the ASLR sysctls
paulf@freebsd:~/scratch/sigreturn $ sysctl -a | grep -i aslr
kern.elf32.aslr.stack_gap: 0 (default is 3)
kern.elf32.aslr.honor_sbrk: 1
kern.elf32.aslr.pie_enable: 1 (default is 0)
kern.elf32.aslr.enable: 1 (default is 0)
vm.aslr_restarts: 0

Are yoiu using ASLR as above?

Comment 7 Nick Briggs 2021-11-08 15:46:14 UTC

No ASLR in use here -- I've got the default settings:
```
$ sysctl -a | grep aslr
kern.elf32.aslr.stack_gap: 3
kern.elf32.aslr.honor_sbrk: 1
kern.elf32.aslr.pie_enable: 0
kern.elf32.aslr.enable: 0
vm.aslr_restarts: 0
```

Comment 8 Nick Briggs 2021-11-08 15:55:40 UTC

I compiled a debug version of libthr.so.3 which produces this stack trace on failure:
```
$ valgrind ./vgtest-thr
==56575== Memcheck, a memory error detector
==56575== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==56575== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==56575== Command: ./vgtest-thr
==56575== 
==56575== Invalid read of size 4
==56575==    at 0x720526B: ??? (src/lib/libthr/thread/thr_sig.c:262)
==56575==    by 0x72048BD: ??? (src/lib/libthr/thread/thr_sig.c:246)
==56575==    by 0x3819FB73: ??? (in /usr/local/libexec/valgrind/memcheck-x86-freebsd)
==56575==    by 0x73B973E: sleep (in /lib/libc.so.7)
==56575==    by 0x4018F2: main (in /var/tmp/vgtest-thr)
==56575==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==56575== 
```
thr_sig.c:246 is within `thr_sighandler(int sig, siginfo_t *info, void *_ucp)` where it calls `handle_signal(&act, sig, info, ucp);` which immediately dies on
```
        /* add previous level mask */
        SIGSETOR(actp->sa_mask, ucp->uc_sigmask);
```

Comment 9 Paul Floyd 2021-11-09 21:20:43 UTC

I'm not sure why, but I only intermittently get VTALRM signals when running under gdb.

thr_sig.c:246 is in 
static void
thr_sighandler(int sig, siginfo_t *info, void *_ucp)

Looking at the Valgrind source, this is "called" here (in m_signals.c)


      /* Create a signal delivery frame, and set the client's %ESP and
	 %EIP so that when execution continues, we will enter the
	 signal handler with the frame on top of the client's stack,
	 as it expects.

	 Signal delivery can fail if the client stack is too small or
	 missing, and we can't push the frame.  If that happens,
	 push_signal_frame will cause the whole process to exit when
	 we next hit the scheduler.
      */
      vg_assert(VG_(is_valid_tid)(tid));

      push_signal_frame ( tid, info, uc );

So rather than calling the signal handler, Valgrind is directly manipulating the client stack and then just resuming execution.

In gdb, I can get some useful debugging done with the separate debuginfo:

(gdb) set directories /usr/src/lib/libthr/thread
and
(gdb) add-symbol-file /usr/lib/debug/lib/libthr.so.3.debug 

(they don't seem to be searched for by default on my amd64 system)

Here's the trampoline for the syscall to sigreturn

│  > 0xffffe194      lea    0x20(%esp),%eax
│    0xffffe198      push   %eax
│    0xffffe199      mov    $0x1a1,%eax
│    0xffffe19e      push   %eax
│    0xffffe19f      int    $0x80


I'm a bit confused as to why the above (calling syscall 417 sigreturn)  then calls thr_sighandler. I would have thought that the handler would be called first and then sigreturn after the user code. Maybe it's used for both.

I see that the arguments are passed to thr_sighandler in ebx/edi/esi

Something similar can be seen in i386/i386/sigtramp.s

/*
 * Signal trampoline, copied to top of user stack
 */
NON_GPROF_ENTRY(sigcode)
        calll   *SIGF_HANDLER(%esp)
        leal    SIGF_UC(%esp),%eax      /* get ucontext */
        pushl   %eax 
        testl   $PSL_VM,UC_EFLAGS(%eax)
        jne     1f   
        mov     UC_GS(%eax),%gs         /* restore %gs */
1:
        movl    $SYS_sigreturn,%eax
        pushl   %eax                    /* junk to fake return addr. */
        int     $0x80                   /* enter kernel with args */
                                        /* on stack */

where

ASSYM(SIGF_UC, offsetof(struct sigframe, sf_uc));

SIGF_UC will be 20 (0x14)

For comparison in Valgrind

        lea	0x1c(%esp), %eax	/* args to sigreturn(ucontext_t *) */
	pushl	%eax
	pushl	%eax			/* fake return addr */
        movl    $__NR_fake_sigreturn, %eax
        int     $0x80

The Valgrtind version of 'struct sigframe' has an extra Address member at the beginning.

That gives us an offset of 24 (0x18), but there is also all of the fiddling with esp that I do in build_sigframe which is where another 4 gets added [24 0x16]. I think.

Comment 10 Paul Floyd 2021-11-09 21:21:38 UTC

And now I think that I have a fix, at least for the crash

diff --git a/coregrind/m_sigframe/sigframe-x86-freebsd.c b/coregrind/m_sigframe/sigframe-x86-freebsd.c
index a1d8638e5..12f51e385 100644
--- a/coregrind/m_sigframe/sigframe-x86-freebsd.c
+++ b/coregrind/m_sigframe/sigframe-x86-freebsd.c
@@ -304,6 +304,8 @@ static Addr build_sigframe(ThreadState *tst,
       err = 0;
    }
 
+   frame->puContext =  (Addr)&frame->uContext;
+
    synth_ucontext(tst->tid, siginfo, trapno, err, mask,
                   &frame->uContext, &frame->fpstate);

Comment 11 Paul Floyd 2021-11-09 22:23:52 UTC

Fixed with

commit 4b8eddfde14291e288a7017edce5c7225e1533d6
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Tue Nov 9 23:11:15 2021 +0100

However I do now get similar kernel messages to those mentioned in the linked github issue. That's another problem though.

Comment 12 Nick Briggs 2021-11-10 00:31:14 UTC

I've rebuilt based on the latest git version, and concur that the SIGSEGV is fixed.

Now it just offers up a continuous stream of "sigreturn eflags..."

% grep memcheck-x86 /tmp/out | sort | uniq -c
  93 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200000
  77 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200004
  46 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200010
   4 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200011
  31 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200014
  43 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200044
   4 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200055
 224 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200080
  12 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200081
 240 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200084
  52 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200085
  20 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200090
  12 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200091
  20 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200094
  82 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200095
  70 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200801
 128 pid 68162 (memcheck-x86-freebs): sigreturn eflags = 0x200805

Comment 13 Paul Floyd 2021-11-10 09:11:27 UTC

The kernel messages are because now it 'works as unintended'.

It's a bit annoying as this is the kernel writing to the console so you can't redirect it to /dev/null. It's also a bit delicate because if the eflags test gets fixed then the following CS test is do-or-die and will cause a SIGBUS if it fails rather than just failing with a kernel message.

Comment 14 Nick Briggs 2021-11-10 16:10:12 UTC

(In reply to Paul Floyd from comment #13)
> The kernel messages are because now it 'works as unintended'.
> 
> It's a bit annoying as this is the kernel writing to the console so you
> can't redirect it to /dev/null. It's also a bit delicate because if the
> eflags test gets fixed then the following CS test is do-or-die and will
> cause a SIGBUS if it fails rather than just failing with a kernel message.

Perhaps I should put in a request for a FreeBSD kernel option...
  sysctl kern.eflags.stfu=1
so one could acknowledge that one knows, but the kernel should stop whinging about it.  ;-)