Bug 82114 - valgrind locks up in regtest after assert: vgPlain_ksigismember(&uc->uc_sigmask, sigNo)
Summary: valgrind locks up in regtest after assert: vgPlain_ksigismember(&uc->uc_sigma...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Compiled Sources Linux
: NOR crash
Target Milestone: ---
Assignee: Tom Hughes
URL:
Keywords:
: 78675 81988 91131 92559 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-05-24 19:33 UTC by Mike Simons
Modified: 2005-02-23 01:04 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Sample of fdleak_cmsg with -v. (5.34 KB, text/plain)
2004-05-24 19:36 UTC, Mike Simons
Details
Patch to make VG_(gettid) work (806 bytes, patch)
2004-07-19 15:34 UTC, Tom Hughes
Details
Sample Thread code to see getpid results on AS 2.1 (922 bytes, text/plain)
2004-07-21 19:57 UTC, Mike Simons
Details
Patch to make VG_(gettid) work (694 bytes, patch)
2004-08-31 21:01 UTC, Tom Hughes
Details
daemon.c (1.14 KB, text/plain)
2005-02-21 07:54 UTC, smile
Details
valgrind_logfile (4.78 KB, text/plain)
2005-02-21 07:56 UTC, smile
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Simons 2004-05-24 19:33:26 UTC
Version:            (using KDE Devel)
Installed from:    Compiled sources
Compiler:          gcc-2.96-118.7.2.i386 
OS:                Linux

Using RHAS 2.1, on ia32, with 2004-05-24 snapshot of Valgrind.
                                                                                                                                                                        
Several of the regtest scripts cause valgrind to lock up
after reporting an assert similar to the following:
===
proxy 13480 for tid 1 exited status -1, res 0
got signal 17 in LWP 13473 (13473)
                                                                                                                                                                        
valgrind: vg_signals.c:2015 (vg_async_signalhandler): Assertion `vgPlain_ksigismember(&uc->uc_sigmask, sigNo)' failed.
==13473==    at 0xB802DE80: vgPlain_skin_assert_fail (vg_mylibc.c:1211)
                                                                                                                                                                        
sched status:
                                                                                                                                                                        
Thread 1: status = Runnable, associated_mx = 0x0, associated_cv = 0x0
==13473==    at 0x810D8959: wait4 (in /lib/libc-2.2.4.so)
==13473==    by 0x8048E82: main (fdleak_cmsg.c:176)
===
                                                                                                                                                                        
^C, ^Z, kill, kill -SEGV, on the valgrind processes seem to
have no effect... fortunately there is kill -9.  :)
                                                                                                                                                                        
                                                                                                                                                                        
A list of the scripts that get stuck are:
===
fdleak_cmsg fdleak_ipv4 sigaltstack coolo_sigaction fork
susphello syscall-restart1 syscall-restart2 system
===
                                                                                                                                                                        
                                                                                                                                                                        
I have the following package/versions installed
===
msimons@furball:~/cvs> rpml glibc gcc binutils
gcc-2.96-118.7.2.i386
glibc-2.2.4-32.11.i686
binutils-2.11.90.0.8-12.i386
===


The regtest summary is below.  Tests that I had to manually
kill -9 to get the next one to run are flagged with a '*'.
===
== 158 tests, 23 stderr failures, 5 stdout failures =================
cachegrind/tests/chdir                   (stderr)
cachegrind/tests/dlclose                 (stderr)
cachegrind/tests/fpu-28-108              (stderr)
cachegrind/tests/insn_basic              (stderr)
cachegrind/tests/insn_cmov               (stderr)
cachegrind/tests/insn_fpu                (stderr)
cachegrind/tests/insn_mmx                (stderr)
cachegrind/tests/insn_sse                (stderr)
cachegrind/tests/insn_sse2               (stderr)
corecheck/tests/as_mmap                  (stderr)
corecheck/tests/fdleak_cmsg            * (stderr)
corecheck/tests/fdleak_ipv4            * (stderr)
corecheck/tests/fdleak_ipv4            * (stdout)
corecheck/tests/pth_atfork1              (stderr)
corecheck/tests/pth_atfork1              (stdout)
memcheck/tests/badfree-2trace            (stderr)
memcheck/tests/new_nothrow               (stderr)
memcheck/tests/sigaltstack             * (stderr)
memcheck/tests/zeropage                  (stderr)
none/tests/coolo_sigaction             * (stderr)
none/tests/coolo_sigaction             * (stdout)
none/tests/fork                        * (stderr)
none/tests/fork                        * (stdout)
none/tests/susphello                   * (stderr)
none/tests/susphello                   * (stdout)
none/tests/syscall-restart1            * (stderr)
none/tests/syscall-restart2            * (stderr)
none/tests/system                      * (stderr)
===
Comment 1 Mike Simons 2004-05-24 19:36:12 UTC
Created attachment 6102 [details]
Sample of fdleak_cmsg with -v.

msimons@furball:~/cvs/latest/valgrind/corecheck/tests> \
  /home/msimons/cvs/latest/valgrind/.in_place/valgrind -v --tool=corecheck \
    --track-fds=yes ./fdleak_cmsg
Comment 2 Mike Simons 2004-05-24 19:39:53 UTC
I also tried backing out CVS versions, to VALGRIND_2_1_1, then VALGRIND_2_1_0,
VALGRIND_2_0_0.  These all replicate the problem.

VALGRIND_1_9_5 was a version that does not replicate the bug.  As far as 
I can tell none of the programs in the test suite there locked up while running
(but the regtest make target didn't exist back then and there appear to be
many differences in the test suite).

The kernel version involved is...

msimons@furball:~> uname -a
Linux furball 2.4.9-e.25enterprise #1 SMP Fri Jun 6 17:55:13 EDT 2003 i686 unknown
Comment 3 Nicholas Nethercote 2004-07-17 15:29:06 UTC
*** Bug 78675 has been marked as a duplicate of this bug. ***
Comment 4 Tom Hughes 2004-07-19 14:19:56 UTC
I've actually managed to reproduce this now as it turned up on one of our boxes...

The problem is that certain kernels (a 2.4.7-10 RedHat 7.2 kernel in my case) have a getpid() system call that returns the same value for all threads but also have no gettid() system call. As a result the assertions in vg_async_signalhandler are effectively bogus as VG_(gettid) will always be VG_(main_pid).

All I have to decide now is what to do about it - just commenting out the sanity checks in vg_async_signalhandler will make it work, but that isn't very nice.
Comment 5 Tom Hughes 2004-07-19 15:34:23 UTC
Created attachment 6733 [details]
Patch to make VG_(gettid) work

This (fairly horrible) patch hopefuly makes VG_(gettid) work on systems where
the getpid() system call really does return the PID but there is no gettid()
system call.

The patch works by falling back to a readlink() on /proc/self if gettid()
returns ENOSYS instead of falling back to getpid().
Comment 6 Tom Hughes 2004-07-20 16:23:49 UTC
*** Bug 81988 has been marked as a duplicate of this bug. ***
Comment 7 Mike Simons 2004-07-21 19:54:56 UTC
Script started on Wed Jul 21 10:38:28 2004
msimons@furball:~/sample/thread> ./test
child 7 running self     8194, pid    16022
child 6 running self    16387, pid    16023
child 5 running self    24580, pid    16024
child 4 running self    32773, pid    16026
child 3 running self    40966, pid    16027
child 2 running self    49159, pid    16028
parent running
child 1 running self    57352, pid    16029
child 0 running self    65545, pid    16030
msimons@furball:~/sample/thread> ~/local/valgrind-5/bin/valgrind --tool=none ./test
==16419== Nulgrind, a binary JIT-compiler for x86-linux.
==16419== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote.
==16419== Using valgrind-2.1.2.CVS, a program supervision framework for x86-linux.
==16419== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward.
==16419== For more details, rerun with: -v
==16419==
parent running
child 7 running self        2, pid    16419
child 6 running self        3, pid    16419
child 5 running self        4, pid    16419
child 4 running self        5, pid    16419
child 3 running self        6, pid    16419
child 2 running self        7, pid    16419
child 1 running self        8, pid    16419
child 0 running self        9, pid    16419
==16419==
                                                                                                                                                                        
Script done on Wed Jul 21 10:53:35 2004
Comment 8 Mike Simons 2004-07-21 19:57:53 UTC
Created attachment 6767 [details]
Sample Thread code to see getpid results on AS 2.1
Comment 9 Mike Simons 2004-07-21 20:00:08 UTC
Tom,
                                                                                                                                                                        
Could you provide me with a little sample program to see what you mean?
                                                                                                                                                                        
                                                                                                                                                                        
My initial reaction to the idea that getpid returns the same value is
"no way"...  I know that threads on this version of Linux all get their
own pid and getpid returns that different pid.  In AS 3 there is a new
thread model called NPTL which fixes many of the differences between
LinuxThreads and true POSIX threads ... (like all threads should get
the same pid when getpid is called).
                                                                                                                                                                        
I have a little test case which creates a bunch of threads, which I'll
attach to this ticket.  The test case calls getpid inside each thread,
runs fine inside valgrind, and but interestingly the test shows that
each thread gets a the same pid when run inside valgrind on AS 2.1.
                                                                                                                                                                        
I'll attach the code/output to the valgrind ticket.
                                                                                                                                                                        
                                                                                                                                                                        
I've not tried to apply your patch yet, to see if it fixes all the
test cases that were failing.
                                                                                                                                                                        
Thanks,
  Mike Simons
Comment 10 Mike Simons 2004-07-21 21:40:23 UTC
On AS 2.1 the VALGRIND_2_1_2 tag of valgrind, locks up in the
following regtest tests:
===
fdleak_cmsg
fdleak_ipv4
pth_atfork1
sigaltstack
coolo_sigaction
fork
susphello
syscall-restart1
syscall-restart2
system
===
                                                                                                                                                         
On AS 3 the VALGRIND_2_1_2 tag of valgrind, has no lockup
problems running regtest.

I'll be applying the "patch" and trying AS 2.1 again.
Comment 11 Mike Simons 2004-07-21 21:49:11 UTC
Results of the "patch" are ... no lockups in the entire test suite
on AS 2.1.

Here is the test results.

== 173 tests, 13 stderr failures, 0 stdout failures =================
cachegrind/tests/chdir                   (stderr)
cachegrind/tests/dlclose                 (stderr)
cachegrind/tests/fpu-28-108              (stderr)
cachegrind/tests/insn_basic              (stderr)
cachegrind/tests/insn_cmov               (stderr)
cachegrind/tests/insn_fpu                (stderr)
cachegrind/tests/insn_mmx                (stderr)
cachegrind/tests/insn_sse                (stderr)
cachegrind/tests/insn_sse2               (stderr)
corecheck/tests/as_mmap                  (stderr)
memcheck/tests/new_nothrow               (stderr)
memcheck/tests/zeropage                  (stderr)
none/tests/syscall-restart1              (stderr)
Comment 12 Mike Simons 2004-07-21 23:07:32 UTC
coregrind/vg_signals.c
   2016    if (VG_(gettid)() == VG_(main_pid)) {
   2017       VG_(printf)("got signal %d in LWP %d (%d)\n",
   2018       sigNo, VG_(gettid)(), VG_(gettid)(), VG_(main_pid));
   2019       vg_assert(VG_(ksigismember)(&uc->uc_sigmask, sigNo));
   2020    }

I think there is a bug on line 2018, "VG_(gettid)()"
should only be used once... since there are only 3 '%d'
formats and there are 4 arguments to printf.


I've also confirmed that AS 2.1 does not have a gettid
function... so I guess it's calling getpid under the covers.
Comment 13 Tom Hughes 2004-07-22 00:43:40 UTC
You're quite right that traditionally getpid() has returned a different value for each thread on linux. That is not what POSIX says it should do however, or what it does on any other Unix system.

Hence recent kernels have changed that behaviour so that it returns the same value in each thread. At the same time (more or less) a new gettid() system call was introduced that does what getpid() used to do and returns the actual ID of the thread that calls it.

This isn't directly related to the linuxthreads->NPTL transition although it was probably done to allow NPTL to be more POSIX compliant, and it may be that linuxthreads will override the getpid() function to give it the old behaviour - I'd have to try it to see.

The problem however is that it seems there are some kernels - though possibly only some RedHat ones with some patches - which have the getpid() change but don't yet have the gettid() call.

By the way, testing what the calls return under valgrind won't tell you much as it is likely to emulate them.

By the sounds of things my patch does fix your system though, which is the main point.
Comment 14 Tom Hughes 2004-08-31 21:01:50 UTC
Created attachment 7373 [details]
Patch to make VG_(gettid) work

I've just realised I posted the wrong patch before... This is what I had
intended to post as the patch for this problem.
Comment 15 Tom Hughes 2004-10-12 20:05:50 UTC
*** Bug 91131 has been marked as a duplicate of this bug. ***
Comment 16 Tom Hughes 2004-10-16 12:46:07 UTC
CVS commit by thughes: 

It seems there are some kernels around where the getpid system call has
been changed to return the ID of the thread group leader but which do not
have a gettid system call.

This breaks VG_(gettid) which assumes that the getpid system call will
give the thread ID if the gettid system call does not exist.

The (horrible) solution is to use readlink to see where /proc/self points
when the gettid system call fails.

BUG: 82114


  M +21 -2     vg_mylibc.c   1.93


--- valgrind/coregrind/vg_mylibc.c  #1.92:1.93
@@ -233,6 +233,25 @@ Int VG_(gettid)(void)
    ret = VG_(do_syscall)(__NR_gettid);
 
-   if (ret == -VKI_ENOSYS)
-      ret = VG_(do_syscall)(__NR_getpid);
+   if (ret == -VKI_ENOSYS) {
+      Char pid[16];
+      
+      /*
+       * The gettid system call does not exist. The obvious assumption
+       * to make at this point would be that we are running on an older
+       * system where the getpid system call actually returns the ID of
+       * the current thread.
+       *
+       * Unfortunately it seems that there are some systems with a kernel
+       * where getpid has been changed to return the ID of the thread group
+       * leader but where the gettid system call has not yet been added.
+       *
+       * So instead of calling getpid here we use readlink to see where
+       * the /proc/self link is pointing...
+       */
+      if ((ret = VG_(do_syscall)(__NR_readlink, "/proc/self", pid, sizeof(pid))) >= 0) {
+         pid[ret] = '\0';
+         ret = VG_(atoll)(pid);
+      }
+   }
 
    return ret;


Comment 17 Tom Hughes 2004-11-02 08:14:37 UTC
*** Bug 92559 has been marked as a duplicate of this bug. ***
Comment 18 smile 2005-02-20 13:20:45 UTC
       *                                        The obvious assumption
       * to make at this point would be that we are running on an older
       * system where the getpid system call actually returns the ID of
       * the current thread. 

       * Unfortunately it seems that there are some systems with a kernel
       * where getpid has been changed to return the ID of the thread group
       * leader but where the gettid system call has not yet been added.


The fix is catering to only "some" machines where getpid is returning the same thing and gettid is not there. what happens to the old ones where getpid is returning different numbers.

Even without the patch using Mike Simons sample code I find 

without valgrind

child 7 running self     8194, pid    20082
child 6 running self    16387, pid    20083
child 5 running self    24580, pid    20084
child 4 running self    32773, pid    20085
child 3 running self    40966, pid    20086
child 2 running self    49159, pid    20087
child 1 running self    57352, pid    20088
parent running
child 0 running self    65545, pid    20089



with valgrind

parent running
child 7 running self        2, pid    13300
child 6 running self        3, pid    13300
child 5 running self        4, pid    13300
child 4 running self        5, pid    13300
child 3 running self        6, pid    13300
child 2 running self        7, pid    13300
child 1 running self        8, pid    13300
child 0 running self        9, pid    13300
got signal 2 in LWP 13300 (13300)

valgrind: vg_signals.c:2024 (vg_async_signalhandler): Assertion `vgPlain_ksigismember(&uc->uc_sigmask, sigNo)' failed.


I am on:

Linux version 2.4.9-e.57 (bhcompile@tweety.build.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 Thu Dec 2 20:56:19 EST 2004
Comment 19 smile 2005-02-21 07:54:29 UTC
Created attachment 9749 [details]
daemon.c
Comment 20 smile 2005-02-21 07:56:21 UTC
Created attachment 9750 [details]
valgrind_logfile
Comment 21 smile 2005-02-21 08:09:12 UTC
Here I have tried to explain the symptoms I noticed in a machine that does not have gettid and has only getpid to return unique ids for each thread.

The above attachement daemon.c creates a daemon process.
On a Linux version 2.4.9-e.57 (bhcompile@tweety.build.redhat.com) (gcc version 2.96 20000731 
(Red Hat Linux 7.2 2.96-129.7.2)) #1 Thu Dec 2 20:56:19 EST 2004 machine


without valgrind it does create one:
------------------------------------

./itsadaemon 

This is a Father 1 10168

This is a father 2 10169
[aime2@stamr02 temp]$ Daemon :
 PID 10170      PPID 1

with valgrind 
-------------

$ ../bin/valgrind --tool=addrcheck ./itsadaemon 

This is a Father 1 9869

This is a father 2 9871
[aime2@stamr02 temp]$ Daemon :
 PID 9873       PPID 0

The attachment id=9750 is the valgrind logfile out of this run.
The following two lines are found in the logfile.
proxy 9872 for tid 1 exited status -1, res 0
proxy 9870 for tid 1 exited status -1, res 0

some other time
---------------
$ ../bin/valgrind --tool=addrcheck ./itsadaemon            
This is a Father 1 9808

This is a father 2 9810
[aime2@stamr02 temp]$ Daemon :
 PID 9812       PPID 6450286


This was done on Valgrind 2.1.2 release with the patch earlier suggested 
attachment id=7373 and with --trace-children=yes

The machine crashes most of the times when I try to kill the valgrind process.


Comment 22 Tom Hughes 2005-02-23 01:04:25 UTC
The patch is still valid for systems where getpid returns different values as reading the /proc/self link also gives different values on such systems.

What you are looking at is internal valgrind code - it has no effect on what the getpid and/or gettid system calls return in the program being run under valgrind. You are correct that in current releases all threads in the client program will appear to have the same pid because they all run in one kernel thread.

This is fixed in the current CVS code where each thread runs in a separate kernel thread.