Version: (using KDE Devel) Installed from: Compiled sources Compiler: gcc-2.96-118.7.2.i386 OS: Linux Using RHAS 2.1, on ia32, with 2004-05-24 snapshot of Valgrind. Several of the regtest scripts cause valgrind to lock up after reporting an assert similar to the following: === proxy 13480 for tid 1 exited status -1, res 0 got signal 17 in LWP 13473 (13473) valgrind: vg_signals.c:2015 (vg_async_signalhandler): Assertion `vgPlain_ksigismember(&uc->uc_sigmask, sigNo)' failed. ==13473== at 0xB802DE80: vgPlain_skin_assert_fail (vg_mylibc.c:1211) sched status: Thread 1: status = Runnable, associated_mx = 0x0, associated_cv = 0x0 ==13473== at 0x810D8959: wait4 (in /lib/libc-2.2.4.so) ==13473== by 0x8048E82: main (fdleak_cmsg.c:176) === ^C, ^Z, kill, kill -SEGV, on the valgrind processes seem to have no effect... fortunately there is kill -9. :) A list of the scripts that get stuck are: === fdleak_cmsg fdleak_ipv4 sigaltstack coolo_sigaction fork susphello syscall-restart1 syscall-restart2 system === I have the following package/versions installed === msimons@furball:~/cvs> rpml glibc gcc binutils gcc-2.96-118.7.2.i386 glibc-2.2.4-32.11.i686 binutils-2.11.90.0.8-12.i386 === The regtest summary is below. Tests that I had to manually kill -9 to get the next one to run are flagged with a '*'. === == 158 tests, 23 stderr failures, 5 stdout failures ================= cachegrind/tests/chdir (stderr) cachegrind/tests/dlclose (stderr) cachegrind/tests/fpu-28-108 (stderr) cachegrind/tests/insn_basic (stderr) cachegrind/tests/insn_cmov (stderr) cachegrind/tests/insn_fpu (stderr) cachegrind/tests/insn_mmx (stderr) cachegrind/tests/insn_sse (stderr) cachegrind/tests/insn_sse2 (stderr) corecheck/tests/as_mmap (stderr) corecheck/tests/fdleak_cmsg * (stderr) corecheck/tests/fdleak_ipv4 * (stderr) corecheck/tests/fdleak_ipv4 * (stdout) corecheck/tests/pth_atfork1 (stderr) corecheck/tests/pth_atfork1 (stdout) memcheck/tests/badfree-2trace (stderr) memcheck/tests/new_nothrow (stderr) memcheck/tests/sigaltstack * (stderr) memcheck/tests/zeropage (stderr) none/tests/coolo_sigaction * (stderr) none/tests/coolo_sigaction * (stdout) none/tests/fork * (stderr) none/tests/fork * (stdout) none/tests/susphello * (stderr) none/tests/susphello * (stdout) none/tests/syscall-restart1 * (stderr) none/tests/syscall-restart2 * (stderr) none/tests/system * (stderr) ===
Created attachment 6102 [details] Sample of fdleak_cmsg with -v. msimons@furball:~/cvs/latest/valgrind/corecheck/tests> \ /home/msimons/cvs/latest/valgrind/.in_place/valgrind -v --tool=corecheck \ --track-fds=yes ./fdleak_cmsg
I also tried backing out CVS versions, to VALGRIND_2_1_1, then VALGRIND_2_1_0, VALGRIND_2_0_0. These all replicate the problem. VALGRIND_1_9_5 was a version that does not replicate the bug. As far as I can tell none of the programs in the test suite there locked up while running (but the regtest make target didn't exist back then and there appear to be many differences in the test suite). The kernel version involved is... msimons@furball:~> uname -a Linux furball 2.4.9-e.25enterprise #1 SMP Fri Jun 6 17:55:13 EDT 2003 i686 unknown
*** Bug 78675 has been marked as a duplicate of this bug. ***
I've actually managed to reproduce this now as it turned up on one of our boxes... The problem is that certain kernels (a 2.4.7-10 RedHat 7.2 kernel in my case) have a getpid() system call that returns the same value for all threads but also have no gettid() system call. As a result the assertions in vg_async_signalhandler are effectively bogus as VG_(gettid) will always be VG_(main_pid). All I have to decide now is what to do about it - just commenting out the sanity checks in vg_async_signalhandler will make it work, but that isn't very nice.
Created attachment 6733 [details] Patch to make VG_(gettid) work This (fairly horrible) patch hopefuly makes VG_(gettid) work on systems where the getpid() system call really does return the PID but there is no gettid() system call. The patch works by falling back to a readlink() on /proc/self if gettid() returns ENOSYS instead of falling back to getpid().
*** Bug 81988 has been marked as a duplicate of this bug. ***
Script started on Wed Jul 21 10:38:28 2004 msimons@furball:~/sample/thread> ./test child 7 running self 8194, pid 16022 child 6 running self 16387, pid 16023 child 5 running self 24580, pid 16024 child 4 running self 32773, pid 16026 child 3 running self 40966, pid 16027 child 2 running self 49159, pid 16028 parent running child 1 running self 57352, pid 16029 child 0 running self 65545, pid 16030 msimons@furball:~/sample/thread> ~/local/valgrind-5/bin/valgrind --tool=none ./test ==16419== Nulgrind, a binary JIT-compiler for x86-linux. ==16419== Copyright (C) 2002-2004, and GNU GPL'd, by Nicholas Nethercote. ==16419== Using valgrind-2.1.2.CVS, a program supervision framework for x86-linux. ==16419== Copyright (C) 2000-2004, and GNU GPL'd, by Julian Seward. ==16419== For more details, rerun with: -v ==16419== parent running child 7 running self 2, pid 16419 child 6 running self 3, pid 16419 child 5 running self 4, pid 16419 child 4 running self 5, pid 16419 child 3 running self 6, pid 16419 child 2 running self 7, pid 16419 child 1 running self 8, pid 16419 child 0 running self 9, pid 16419 ==16419== Script done on Wed Jul 21 10:53:35 2004
Created attachment 6767 [details] Sample Thread code to see getpid results on AS 2.1
Tom, Could you provide me with a little sample program to see what you mean? My initial reaction to the idea that getpid returns the same value is "no way"... I know that threads on this version of Linux all get their own pid and getpid returns that different pid. In AS 3 there is a new thread model called NPTL which fixes many of the differences between LinuxThreads and true POSIX threads ... (like all threads should get the same pid when getpid is called). I have a little test case which creates a bunch of threads, which I'll attach to this ticket. The test case calls getpid inside each thread, runs fine inside valgrind, and but interestingly the test shows that each thread gets a the same pid when run inside valgrind on AS 2.1. I'll attach the code/output to the valgrind ticket. I've not tried to apply your patch yet, to see if it fixes all the test cases that were failing. Thanks, Mike Simons
On AS 2.1 the VALGRIND_2_1_2 tag of valgrind, locks up in the following regtest tests: === fdleak_cmsg fdleak_ipv4 pth_atfork1 sigaltstack coolo_sigaction fork susphello syscall-restart1 syscall-restart2 system === On AS 3 the VALGRIND_2_1_2 tag of valgrind, has no lockup problems running regtest. I'll be applying the "patch" and trying AS 2.1 again.
Results of the "patch" are ... no lockups in the entire test suite on AS 2.1. Here is the test results. == 173 tests, 13 stderr failures, 0 stdout failures ================= cachegrind/tests/chdir (stderr) cachegrind/tests/dlclose (stderr) cachegrind/tests/fpu-28-108 (stderr) cachegrind/tests/insn_basic (stderr) cachegrind/tests/insn_cmov (stderr) cachegrind/tests/insn_fpu (stderr) cachegrind/tests/insn_mmx (stderr) cachegrind/tests/insn_sse (stderr) cachegrind/tests/insn_sse2 (stderr) corecheck/tests/as_mmap (stderr) memcheck/tests/new_nothrow (stderr) memcheck/tests/zeropage (stderr) none/tests/syscall-restart1 (stderr)
coregrind/vg_signals.c 2016 if (VG_(gettid)() == VG_(main_pid)) { 2017 VG_(printf)("got signal %d in LWP %d (%d)\n", 2018 sigNo, VG_(gettid)(), VG_(gettid)(), VG_(main_pid)); 2019 vg_assert(VG_(ksigismember)(&uc->uc_sigmask, sigNo)); 2020 } I think there is a bug on line 2018, "VG_(gettid)()" should only be used once... since there are only 3 '%d' formats and there are 4 arguments to printf. I've also confirmed that AS 2.1 does not have a gettid function... so I guess it's calling getpid under the covers.
You're quite right that traditionally getpid() has returned a different value for each thread on linux. That is not what POSIX says it should do however, or what it does on any other Unix system. Hence recent kernels have changed that behaviour so that it returns the same value in each thread. At the same time (more or less) a new gettid() system call was introduced that does what getpid() used to do and returns the actual ID of the thread that calls it. This isn't directly related to the linuxthreads->NPTL transition although it was probably done to allow NPTL to be more POSIX compliant, and it may be that linuxthreads will override the getpid() function to give it the old behaviour - I'd have to try it to see. The problem however is that it seems there are some kernels - though possibly only some RedHat ones with some patches - which have the getpid() change but don't yet have the gettid() call. By the way, testing what the calls return under valgrind won't tell you much as it is likely to emulate them. By the sounds of things my patch does fix your system though, which is the main point.
Created attachment 7373 [details] Patch to make VG_(gettid) work I've just realised I posted the wrong patch before... This is what I had intended to post as the patch for this problem.
*** Bug 91131 has been marked as a duplicate of this bug. ***
CVS commit by thughes: It seems there are some kernels around where the getpid system call has been changed to return the ID of the thread group leader but which do not have a gettid system call. This breaks VG_(gettid) which assumes that the getpid system call will give the thread ID if the gettid system call does not exist. The (horrible) solution is to use readlink to see where /proc/self points when the gettid system call fails. BUG: 82114 M +21 -2 vg_mylibc.c 1.93 --- valgrind/coregrind/vg_mylibc.c #1.92:1.93 @@ -233,6 +233,25 @@ Int VG_(gettid)(void) ret = VG_(do_syscall)(__NR_gettid); - if (ret == -VKI_ENOSYS) - ret = VG_(do_syscall)(__NR_getpid); + if (ret == -VKI_ENOSYS) { + Char pid[16]; + + /* + * The gettid system call does not exist. The obvious assumption + * to make at this point would be that we are running on an older + * system where the getpid system call actually returns the ID of + * the current thread. + * + * Unfortunately it seems that there are some systems with a kernel + * where getpid has been changed to return the ID of the thread group + * leader but where the gettid system call has not yet been added. + * + * So instead of calling getpid here we use readlink to see where + * the /proc/self link is pointing... + */ + if ((ret = VG_(do_syscall)(__NR_readlink, "/proc/self", pid, sizeof(pid))) >= 0) { + pid[ret] = '\0'; + ret = VG_(atoll)(pid); + } + } return ret;
*** Bug 92559 has been marked as a duplicate of this bug. ***
* The obvious assumption * to make at this point would be that we are running on an older * system where the getpid system call actually returns the ID of * the current thread. * Unfortunately it seems that there are some systems with a kernel * where getpid has been changed to return the ID of the thread group * leader but where the gettid system call has not yet been added. The fix is catering to only "some" machines where getpid is returning the same thing and gettid is not there. what happens to the old ones where getpid is returning different numbers. Even without the patch using Mike Simons sample code I find without valgrind child 7 running self 8194, pid 20082 child 6 running self 16387, pid 20083 child 5 running self 24580, pid 20084 child 4 running self 32773, pid 20085 child 3 running self 40966, pid 20086 child 2 running self 49159, pid 20087 child 1 running self 57352, pid 20088 parent running child 0 running self 65545, pid 20089 with valgrind parent running child 7 running self 2, pid 13300 child 6 running self 3, pid 13300 child 5 running self 4, pid 13300 child 4 running self 5, pid 13300 child 3 running self 6, pid 13300 child 2 running self 7, pid 13300 child 1 running self 8, pid 13300 child 0 running self 9, pid 13300 got signal 2 in LWP 13300 (13300) valgrind: vg_signals.c:2024 (vg_async_signalhandler): Assertion `vgPlain_ksigismember(&uc->uc_sigmask, sigNo)' failed. I am on: Linux version 2.4.9-e.57 (bhcompile@tweety.build.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 Thu Dec 2 20:56:19 EST 2004
Created attachment 9749 [details] daemon.c
Created attachment 9750 [details] valgrind_logfile
Here I have tried to explain the symptoms I noticed in a machine that does not have gettid and has only getpid to return unique ids for each thread. The above attachement daemon.c creates a daemon process. On a Linux version 2.4.9-e.57 (bhcompile@tweety.build.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 Thu Dec 2 20:56:19 EST 2004 machine without valgrind it does create one: ------------------------------------ ./itsadaemon This is a Father 1 10168 This is a father 2 10169 [aime2@stamr02 temp]$ Daemon : PID 10170 PPID 1 with valgrind ------------- $ ../bin/valgrind --tool=addrcheck ./itsadaemon This is a Father 1 9869 This is a father 2 9871 [aime2@stamr02 temp]$ Daemon : PID 9873 PPID 0 The attachment id=9750 is the valgrind logfile out of this run. The following two lines are found in the logfile. proxy 9872 for tid 1 exited status -1, res 0 proxy 9870 for tid 1 exited status -1, res 0 some other time --------------- $ ../bin/valgrind --tool=addrcheck ./itsadaemon This is a Father 1 9808 This is a father 2 9810 [aime2@stamr02 temp]$ Daemon : PID 9812 PPID 6450286 This was done on Valgrind 2.1.2 release with the patch earlier suggested attachment id=7373 and with --trace-children=yes The machine crashes most of the times when I try to kill the valgrind process.
The patch is still valid for systems where getpid returns different values as reading the /proc/self link also gives different values on such systems. What you are looking at is internal valgrind code - it has no effect on what the getpid and/or gettid system calls return in the program being run under valgrind. You are correct that in current releases all threads in the client program will appear to have the same pid because they all run in one kernel thread. This is fixed in the current CVS code where each thread runs in a separate kernel thread.