Valgrind hangs after SIGKILLed STEPS TO REPRODUCE 1. Create a multi-threaded application. 2. Raise the SIGKILL in the code of the application. 3. Run the application under memcheck. OBSERVED RESULT Valgrind hangs EXPECTED RESULT Valgrind terminates SOFTWARE/OS VERSIONS RedHat/Linux ADDITIONAL INFORMATION # strace -ff valgind <the app> -c 'sigraise 9' shows SIGKILL neither sent nor received by any thread; it just shows the main thread exits and the second thread getting stuck waiting on a futex.
uname -a Linux localhost.localdomain 3.10.0-862.20.2.vz7.73.29 #1 SMP Thu Feb 21 17:35:43 MSK 2019 x86_64 x86_64 x86_64 GNU/Linux
Ran in bash script as valgrind --log-file="${VALGRIND_LOGFILE}" --error-exitcode=99 "$PROG_NAME"
I suspect this is hard to avoid given the way threading works in valgrind - my guess is that the signal is only killing the active thread, and because it's SIGKILL it's not detectable by valgrind. That leaves the other threads waiting to be given permission to run because valgrind uses a pipe to ensure that only one thread can run at a time. I'm slightly surprised that SIGKILL doesn't act on the whole process but that's just off the top of my head. You might find --fair-sched=yes works better because it uses a different system for doing the inter-thread locking that might manage to detect the failing thread. No guarantees though. Killing yourself with SIGKILL is a pretty evil thing to do in any case.
Thank you Tom for your quick response. I have run the Valgrind with the key '--fair-sched=yes' and got the same hang issue. I agree with you about the 'pretty evil thing' but that is a unit test case.
IMO, this is supposed to work: If the application is sending SIGKILL to itself, the syscall is intercepted, and some special handling is suppose to happen to ensure the process dies. See e.g. m_signals.c async_signalhandler and/or syswrap-generic.c ML_(do_sigkill). So, if it does not work, this looks to be a real bug. do you have a small compilable reproducer ?
Sorry about using the application specific command 'sigraise 9' in the description. It stands for calling 'kill -9' from within the application that calls raise(9) in the main thread. The reproducer is comming soon...
Repro: $ cat x.c #include <signal.h> #include <unistd.h> #include <pthread.h> #include <stdlib.h> void *t(void *p) { sleep(1000); return NULL; } int main(int argc, char **argv) { pthread_t thr; int s = atoi(argv[1]); pthread_create(&thr, NULL, t, NULL); raise(s); } $ gcc -pthread -g x.c -o x $ valgrind ./x 15 ==18380== Memcheck, a memory error detector ==18380== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==18380== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==18380== Command: ./x 15 ==18380== ==18380== ==18380== Process terminating with default action of signal 15 (SIGTERM) ==18380== at 0x4881D15: raise (in /usr/lib64/libpthread-2.29.so) ==18380== by 0x4011BC: main (x.c:20) ==18380== ==18380== HEAP SUMMARY: ==18380== in use at exit: 272 bytes in 1 blocks ==18380== total heap usage: 1 allocs, 0 frees, 272 bytes allocated ==18380== ==18380== LEAK SUMMARY: ==18380== definitely lost: 0 bytes in 0 blocks ==18380== indirectly lost: 0 bytes in 0 blocks ==18380== possibly lost: 272 bytes in 1 blocks ==18380== still reachable: 0 bytes in 0 blocks ==18380== suppressed: 0 bytes in 0 blocks ==18380== Rerun with --leak-check=full to see details of leaked memory ==18380== ==18380== For lists of detected and suppressed errors, rerun with: -s ==18380== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Terminated $ valgrind ./x 9 ==18382== Memcheck, a memory error detector ==18382== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==18382== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==18382== Command: ./x 9 ==18382== ... Here it the main thread becomes zombie, the child thread sleeps until the expiration of 1000s. In the real-world case the child thread did a wait on a mutex which never got released.
Thanks for the small reproducer. This small test case is revealing a bunch of problems related to termination of a process when it kills itself, and some problems in the gdbserver debug tracing. This last thing was easily fixable (commit 90d831171). Otherwise, it looks like there are at least 2 problems: * the hang reported here * but a similar hang in case the main thread is sending the signal to the other thread. We then seem to have a race condition between the main thread that exits, and the other thread that tries to kill the process.
See also bug https://bugs.kde.org/show_bug.cgi?id=372600 This bug seems somewhat related/similar to the above.
Thank you for taking the issue into account. Hopefully, it will be resolved.
A patch fixing this problem (and also bug 409367) has been attached to bug 409367. If no remarks on the approach, I will push in a few days.
Philippe, I looked at this just now. The analysis and patch seem reasonable to me, so +1 for landing it.
Thanks for the review. Pushed as 63a9f0793