Bug 409141 - Valgrind hangs when SIGKILLed
Summary: Valgrind hangs when SIGKILLed
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.13.0
Platform: RedHat Enterprise Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Philippe Waroquiers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-24 15:10 UTC by Andrey Shinkevich
Modified: 2019-07-10 22:37 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Shinkevich 2019-06-24 15:10:05 UTC
Valgrind hangs after SIGKILLed


STEPS TO REPRODUCE
1. Create a multi-threaded application.
2. Raise the SIGKILL in the code of the application.
3. Run the application under memcheck.

OBSERVED RESULT
Valgrind hangs

EXPECTED RESULT
Valgrind terminates

SOFTWARE/OS VERSIONS
RedHat/Linux

ADDITIONAL INFORMATION
# strace -ff valgind <the app> -c 'sigraise 9'
shows SIGKILL neither sent nor received by any thread; it just shows the
main thread exits and the second thread getting stuck waiting on a futex.
Comment 1 Andrey Shinkevich 2019-06-24 15:17:11 UTC
uname -a
Linux localhost.localdomain 3.10.0-862.20.2.vz7.73.29 #1 SMP Thu Feb 21 17:35:43 MSK 2019 x86_64 x86_64 x86_64 GNU/Linux
Comment 2 Andrey Shinkevich 2019-06-24 15:24:33 UTC
Ran in bash script as
valgrind --log-file="${VALGRIND_LOGFILE}" --error-exitcode=99 "$PROG_NAME"
Comment 3 Tom Hughes 2019-06-24 16:07:09 UTC
I suspect this is hard to avoid given the way threading works in valgrind - my guess is that the signal is only killing the active thread, and because it's SIGKILL it's not detectable by valgrind. That leaves the other threads waiting to be given permission to run because valgrind uses a pipe to ensure that only one thread can run at a time.

I'm slightly surprised that SIGKILL doesn't act on the whole process but that's just off the top of my head.

You might find --fair-sched=yes works better because it uses a different system for doing the inter-thread locking that might manage to detect the failing thread. No guarantees though.

Killing yourself with SIGKILL is a pretty evil thing to do in any case.
Comment 4 Andrey Shinkevich 2019-06-24 16:41:00 UTC
Thank you Tom for your quick response. I have run the Valgrind with the key '--fair-sched=yes' and got the same hang issue. I agree with you about the 'pretty evil thing' but that is a unit test case.
Comment 5 Philippe Waroquiers 2019-06-24 19:48:52 UTC
IMO, this is supposed to work:
If the application is sending SIGKILL to itself,
the syscall is intercepted, and some special handling is suppose
to happen to ensure the process dies.

See e.g. m_signals.c async_signalhandler
and/or syswrap-generic.c ML_(do_sigkill).

So, if it does not work, this looks to be a real bug.

do you have a small compilable reproducer ?
Comment 6 Andrey Shinkevich 2019-06-25 13:35:55 UTC
Sorry about using the application specific command 'sigraise 9' in the description. It stands for calling 'kill -9' from within the application that calls raise(9) in the main thread.
The reproducer is comming soon...
Comment 7 Roman Kagan 2019-06-25 15:17:37 UTC
Repro:

$ cat x.c
#include <signal.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>

void *t(void *p)
{
        sleep(1000);
        return NULL;
}

int main(int argc, char **argv)
{
        pthread_t thr;

        int s = atoi(argv[1]);

        pthread_create(&thr, NULL, t, NULL);

        raise(s);
}
$ gcc -pthread -g x.c -o x
$ valgrind ./x 15
==18380== Memcheck, a memory error detector
==18380== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18380== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==18380== Command: ./x 15
==18380==
==18380==
==18380== Process terminating with default action of signal 15 (SIGTERM)
==18380==    at 0x4881D15: raise (in /usr/lib64/libpthread-2.29.so)
==18380==    by 0x4011BC: main (x.c:20)
==18380==
==18380== HEAP SUMMARY:
==18380==     in use at exit: 272 bytes in 1 blocks
==18380==   total heap usage: 1 allocs, 0 frees, 272 bytes allocated
==18380==
==18380== LEAK SUMMARY:
==18380==    definitely lost: 0 bytes in 0 blocks
==18380==    indirectly lost: 0 bytes in 0 blocks
==18380==      possibly lost: 272 bytes in 1 blocks
==18380==    still reachable: 0 bytes in 0 blocks
==18380==         suppressed: 0 bytes in 0 blocks
==18380== Rerun with --leak-check=full to see details of leaked memory
==18380==
==18380== For lists of detected and suppressed errors, rerun with: -s
==18380== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Terminated
$ valgrind ./x 9
==18382== Memcheck, a memory error detector
==18382== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==18382== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==18382== Command: ./x 9
==18382==
...
Here it the main thread becomes zombie, the child thread sleeps until the
expiration of 1000s.  In the real-world case the child thread did a wait on a
mutex which never got released.
Comment 8 Philippe Waroquiers 2019-06-25 20:56:40 UTC
Thanks for the small reproducer.

This small test case is revealing a bunch of problems related
to termination of a process when it kills itself,
and some problems in the gdbserver debug tracing.
This last thing was easily fixable (commit 90d831171).

Otherwise, it looks like there are at least 2 problems:
  * the hang reported here
  * but a similar hang in case the main thread is sending the signal
    to the other thread.  We then seem to have a race condition between
    the main thread that exits, and the other thread that tries to kill
    the process.
Comment 9 Philippe Waroquiers 2019-06-25 21:09:31 UTC
See also bug https://bugs.kde.org/show_bug.cgi?id=372600

This bug seems somewhat related/similar to the above.
Comment 10 Andrey Shinkevich 2019-06-26 09:30:15 UTC
Thank you for taking the issue into account. Hopefully, it will be resolved.
Comment 11 Philippe Waroquiers 2019-07-02 20:31:00 UTC
A patch fixing this problem (and also bug 409367) has been
attached to bug 409367.

If no remarks on the approach, I will push in a few days.
Comment 12 Julian Seward 2019-07-10 15:45:37 UTC
Philippe, I looked at this just now.  The analysis and patch seem
reasonable to me, so +1 for landing it.
Comment 13 Philippe Waroquiers 2019-07-10 22:37:24 UTC
Thanks for the review.
Pushed as 63a9f0793