Bug 361615 - Inconsistent termination for multithreaded process terminated by signal
Summary: Inconsistent termination for multithreaded process terminated by signal
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.11 SVN
Platform: Ubuntu Linux
: NOR wishlist
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-10 22:54 UTC by earl_chew
Modified: 2016-09-24 12:08 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Candidate patch to propagate termination signal (3.61 KB, patch)
2016-04-10 23:11 UTC, earl_chew
Details
Self contained test case (deleted)
2016-07-18 06:47 UTC, earl_chew
Details
Standalone test case (1.64 KB, text/plain)
2016-09-17 17:27 UTC, earl_chew
Details

Note You need to log in before you can comment on or make changes to this bug.
Description earl_chew 2016-04-10 22:54:18 UTC
The valgrind code attempts to terminate the program in a way that is consistent with the disposition of a signal that is not handled by the instrumented program. Unfortunately, the results depend upon whether the unhandled signal is received by the main thread, or another thread.

Reproducible: Always

Steps to Reproduce:
Running the following test script (which compiles and runs the test program) illustrates the problem:

-------------------------------------------------------
#!/bin/sh

test()
{
    rm -f foo.pid
    "$@" > foo.pid 2>/dev/null &
    PID=$!
    while [ "$(wc -l < foo.pid)" -lt 2 ] ; do
        sleep 1
    done
    head -1 foo.pid
    kill $(tail -1 foo.pid)
    wait $PID
    echo $?
}

gcc -o foo foo.c -lpthread

echo Plain
test ./foo
test ./foo x
echo ''

valgrind --version
test valgrind ./foo
test valgrind ./foo x
echo ''

valgrind/bin/valgrind --version
test valgrind/bin/valgrind ./foo
test valgrind/bin/valgrind ./foo x
echo ''
-------------------------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <pthread.h>

void *
slavethread(void *arg)
{
    while (1)
        sleep(1);
}

int main(int argc, char **argv)
{
    pthread_t slave;

    if (argc > 1)
        printf("unblocked\n");
    else
    {
        if (pthread_create(&slave, 0, &slavethread, 0))
            return 255;

        sigset_t sigmask;

        if (sigfillset(&sigmask))
            return 255;

        if (sigprocmask(SIG_BLOCK, &sigmask, 0))
            return 255;

        printf("blocked\n");
    }

    printf("%d\n", getpid());
    fflush(stdout);

    while (1)
        sleep(1);

    return 0;
}

Actual Results:  
$ sh foo.sh
Plain
blocked
143
unblocked
143

valgrind-3.10.0.SVN
blocked
137
unblocked
143

valgrind-3.11.0
blocked
137
unblocked
143


Expected Results:  
Ideally, all the test cases should show:
--------------------
blocked
143
unblocked
143
--------------------

In the absence of that, the exit codes between the blocked and unblocked cases should be the same (and not show 137 in the blocked case, and 143 in the unblocked case).
Comment 1 earl_chew 2016-04-10 23:11:28 UTC
Created attachment 98326 [details]
Candidate patch to propagate termination signal

Here is an attempt to propagate the signal that terminates a thread into the entire program.
Comment 2 Julian Seward 2016-07-05 07:07:57 UTC
Do you have a small test case which demonstrates that Valgrind's behaviour at
present, differs from when the program is run "natively" ?  I'd be happier about
this if you have something directly that demonstrates that V's behaviour differs
from that of the kernel, in this respect.
Comment 3 earl_chew 2016-07-07 07:19:31 UTC
(In reply to Julian Seward from comment #2)
> Do you have a small test case which demonstrates that Valgrind's behaviour at
> present, differs from when the program is run "natively" ?

In the bug description, I pasted a small C program and accompanying shell script that demonstrates the difference between a natively run program, and the same running under valgrind.

Are you looking for something different ?
Comment 4 earl_chew 2016-07-18 06:47:58 UTC
Created attachment 100145 [details]
Self contained test case
Comment 5 earl_chew 2016-07-18 06:50:11 UTC
I compiled and ran the submitted test case both with and without valgrind. This shows that under valgrind, the child process terminates with SIGKILL rather than the expected SIGTERM.

$ gcc -o test test.c -lpthread && ./test
Signalled with 15

$ gcc -o test test.c -lpthread && valgrind ./test
==14625== Memcheck, a memory error detector
==14625== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==14625== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==14625== Command: ./test
==14625== 
==14626== 
==14626== Process terminating with default action of signal 15 (SIGTERM)
==14626==    at 0x4124B96: ??? (syscall-template.S:81)
==14626==    by 0x412493C: sleep (sleep.c:137)
==14626==    by 0x804868E: slavethread (in /tmp/test)
==14626==    by 0x4058F71: start_thread (pthread_create.c:312)
==14626==    by 0x415AF8D: clone (clone.S:129)
==14626== 
==14626== HEAP SUMMARY:
==14626==     in use at exit: 136 bytes in 1 blocks
==14626==   total heap usage: 1 allocs, 0 frees, 136 bytes allocated
==14626== 
==14626== LEAK SUMMARY:
==14626==    definitely lost: 0 bytes in 0 blocks
==14626==    indirectly lost: 0 bytes in 0 blocks
==14626==      possibly lost: 136 bytes in 1 blocks
==14626==    still reachable: 0 bytes in 0 blocks
==14626==         suppressed: 0 bytes in 0 blocks
==14626== Rerun with --leak-check=full to see details of leaked memory
==14626== 
==14626== For counts of detected and suppressed errors, rerun with: -v
==14626== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Signalled with 9
test: test.c:52: main: Assertion `15 == (((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0x7f)' failed.
==14625== 
==14625== Process terminating with default action of signal 6 (SIGABRT)
==14625==    at 0x409D687: raise (raise.c:56)
==14625==    by 0x40A0AB2: abort (abort.c:89)
==14625==    by 0x40967C6: __assert_fail_base (assert.c:92)
==14625==    by 0x4096876: __assert_fail (assert.c:101)
==14625==    by 0x8048816: main (in /tmp/test)
==14625== 
==14625== HEAP SUMMARY:
==14625==     in use at exit: 0 bytes in 0 blocks
==14625==   total heap usage: 3 allocs, 3 frees, 550 bytes allocated
==14625== 
==14625== All heap blocks were freed -- no leaks are possible
==14625== 
==14625== For counts of detected and suppressed errors, rerun with: -v
==14625== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted
Comment 6 Julian Seward 2016-09-13 17:07:06 UTC
Philippe, didn't you fix something like this recently?
Comment 7 Philippe Waroquiers 2016-09-13 20:05:14 UTC
(In reply to Julian Seward from comment #6)
> Philippe, didn't you fix something like this recently?
Not recently.
Related to thread termination, I did something some years ago, to fix bug 324227.
I can take a look at this bug and see if I find something.
Comment 8 Philippe Waroquiers 2016-09-13 20:05:53 UTC
(In reply to earl_chew from comment #4)
> Created attachment 100145 [details]
> Self contained test case

It seems there was an attachment problem.
Could you re-attach the self contained test case ?
Thanks
Comment 9 earl_chew 2016-09-14 00:26:41 UTC
Comment on attachment 100145 [details]
Self contained test case

.
Comment 10 Ivo Raisr 2016-09-14 04:31:16 UTC
As Philippe pointed out, there is a problem with the existing attachment.
Please have a look what is there inside and re-attach.
Comment 11 earl_chew 2016-09-14 05:15:43 UTC
(In reply to Ivo Raisr from comment #10)
> As Philippe pointed out, there is a problem with the existing attachment.
> Please have a look what is there inside and re-attach.

Yes, that's obviously wrong. Let me resurrect the intended content.
Comment 12 Ben Cooksley 2016-09-14 07:53:57 UTC
The content of attachment 100145 [details] has been deleted for the following reason:

Removing non-relevant content
Comment 13 earl_chew 2016-09-17 17:27:54 UTC
Created attachment 101149 [details]
Standalone test case

I could not find the previously mentioned standalone test case, so I've uploaded a reconstructed version that illustrates the same behaviour.

Note also that I have included a candidate patch with this ticket: https://bugs.kde.org/attachment.cgi?id=98326

When run outside valgrind, the test program prints:
---------------------------------------------------
$ gcc -o test test.c -lpthread && ./test ; echo $?
Signal 15
0
---------------------------------------------------

When run under valgrind, the result is different:
---------------------------------------------------
$ gcc -o test test.c -lpthread && valgrind ./test ; echo $?
==3351== Memcheck, a memory error detector
==3351== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3351== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3351== Command: ./test
==3351== 
==3352== 
==3352== Process terminating with default action of signal 15 (SIGTERM)
==3352==    at 0x405DF3F: pthread_sigmask (pthread_sigmask.c:48)
==3352==    by 0x804875E: slavethread (in /home/earl/test)
==3352==    by 0x4058F71: start_thread (pthread_create.c:312)
==3352==    by 0x415AF8D: clone (clone.S:129)
==3352== 
==3352== HEAP SUMMARY:
==3352==     in use at exit: 136 bytes in 1 blocks
==3352==   total heap usage: 1 allocs, 0 frees, 136 bytes allocated
==3352== 
==3352== LEAK SUMMARY:
==3352==    definitely lost: 0 bytes in 0 blocks
==3352==    indirectly lost: 0 bytes in 0 blocks
==3352==      possibly lost: 136 bytes in 1 blocks
==3352==    still reachable: 0 bytes in 0 blocks
==3352==         suppressed: 0 bytes in 0 blocks
==3352== Rerun with --leak-check=full to see details of leaked memory
==3352== 
==3352== For counts of detected and suppressed errors, rerun with: -v
==3352== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Signal 9
test: test.c:88: main: Assertion `(((__extension__ (((union { __typeof(status) __in; int __i; }) { .__in = (status) }).__i))) & 0x7f) == 15' failed.
==3351== 
==3351== Process terminating with default action of signal 6 (SIGABRT)
==3351==    at 0x409D687: raise (raise.c:56)
==3351==    by 0x40A0AB2: abort (abort.c:89)
==3351==    by 0x40967C6: __assert_fail_base (assert.c:92)
==3351==    by 0x4096876: __assert_fail (assert.c:101)
==3351==    by 0x80489E5: main (in /home/earl/test)
==3351== 
==3351== HEAP SUMMARY:
==3351==     in use at exit: 0 bytes in 0 blocks
==3351==   total heap usage: 3 allocs, 3 frees, 550 bytes allocated
==3351== 
==3351== All heap blocks were freed -- no leaks are possible
==3351== 
==3351== For counts of detected and suppressed errors, rerun with: -v
==3351== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted
134
---------------------------------------------------
Comment 14 Philippe Waroquiers 2016-09-24 12:08:20 UTC
Fixed in revision 15982.

Note that the fix committed consists in calling
 VG_(reap_threads)(tid);
after nuke all threads.

Thanks for the bug report and the test case.