Bug 410599 - Non-deterministic behaviour of pth_self_kill_15_other test
Summary: Non-deterministic behaviour of pth_self_kill_15_other test
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.15 SVN
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-05 06:27 UTC by Stefan Maksimovic
Modified: 2019-08-13 14:56 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
pth_self_kill.patch (850 bytes, text/plain)
2019-08-09 08:44 UTC, Stefan Maksimovic
Details
pth_self_kill.patch v2 (954 bytes, text/plain)
2019-08-12 10:31 UTC, Stefan Maksimovic
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Maksimovic 2019-08-05 06:27:22 UTC
A recent commit https://sourceware.org/git/?p=valgrind.git;a=commit;h=63a9f0793113fd5d828ea7b6183812ad71f924f1
has introduced a test which exhibits different behaviour on some platforms.

Namely, running the pth_self_kill_15_other test on these can end in either of the following:
1) the spawned thread finishes first
2) the main thread finishes first

Running the test multiple times in succession we observed that on x86 the test finishes as described in the 2) case
whereas on others either of the two cases can be present.
We have seen this behaviour on different arm and mips platforms.

In the 2) case the output we get corresponds with the .exp file while in the 1) case we get an extra 'Terminated' string from the kernel on stderr.

Moreover, we ran the test on arm/mips without the functionality the rest of that patch provides, to test whether it really hangs/loops on arm/mips or not.
Interestingly the pth_self_kill_9 test behaves the same on arm/mips and x86 whereas the pth_self_kill_15_other does finish on arm/mips
(it prints the 'Terminated' message - the spawned thread finishes first).

A possible solution would be to make the test deterministic; one way would consist of inserting a pthread_join call.
That would alter the test in terms of the output produced but we believe that the nature of the test itself would remain intact.
Reading the commit message which introduced the tests, we gather that the purpose was to test two scenarios(loop/hang) which the
commit was created to solve.
In case the above suggested change would not disrupt the intended functionality of the test, would it be applicable?

What course of action would you recommend?
Comment 1 Philippe Waroquiers 2019-08-07 19:11:28 UTC
(In reply to Stefan Maksimovic from comment #0)
> A recent commit
> https://sourceware.org/git/?p=valgrind.git;a=commit;
> h=63a9f0793113fd5d828ea7b6183812ad71f924f1
> has introduced a test which exhibits different behaviour on some platforms.
> 
> Namely, running the pth_self_kill_15_other test on these can end in either
> of the following:
> 1) the spawned thread finishes first
> 2) the main thread finishes first
> 
> Running the test multiple times in succession we observed that on x86 the
> test finishes as described in the 2) case
> whereas on others either of the two cases can be present.
> We have seen this behaviour on different arm and mips platforms.
> 
> In the 2) case the output we get corresponds with the .exp file while in the
> 1) case we get an extra 'Terminated' string from the kernel on stderr.
> 
> Moreover, we ran the test on arm/mips without the functionality the rest of
> that patch provides, to test whether it really hangs/loops on arm/mips or
> not.
> Interestingly the pth_self_kill_9 test behaves the same on arm/mips and x86
> whereas the pth_self_kill_15_other does finish on arm/mips
> (it prints the 'Terminated' message - the spawned thread finishes first).
> 
> A possible solution would be to make the test deterministic; one way would
> consist of inserting a pthread_join call.
> That would alter the test in terms of the output produced but we believe
> that the nature of the test itself would remain intact.
> Reading the commit message which introduced the tests, we gather that the
> purpose was to test two scenarios(loop/hang) which the
> commit was created to solve.
> In case the above suggested change would not disrupt the intended
> functionality of the test, would it be applicable?
> 
> What course of action would you recommend?

Thanks for looking at this (this part of the code and the related tests
 are very tricky).

I suggest you reproduce the bug by using the test program
and the previous version of Valgrind.

Then modify the test as you want to make it deterministic, but
verify that the test still triggers the bug with the old version
of Valgrind.

Thanks
Comment 2 Stefan Maksimovic 2019-08-09 08:44:19 UTC
Created attachment 122034 [details]
pth_self_kill.patch

Here's our attempt at making the test deterministic:
We decided on using a pthread_join call where the main thread would wait for the spawned one.
This had it's shortcomings as it did not work for the pth_self_kill_15_other case.

To work around this we made the spawned thread to wait for the parent by modifying the signal handler for SIGTERM,
since that was the case which did not work for our previous attempt.

With this we managed to both reproduce the loop/hang case without the commit mentioned in the bug description and
to run the test without those problems with the commit on x86.

If it's not too much trouble, I suggest you test it yourself just to make sure.
Comment 3 Philippe Waroquiers 2019-08-10 09:36:53 UTC
(In reply to Stefan Maksimovic from comment #2)
> If it's not too much trouble, I suggest you test it yourself just to make
> sure.
I tested, and the modified test still reproduces the bug with the old
release (and is working ok with the new release).

However, the test itself is now using uninitialised variables:

==32132== Syscall param rt_sigaction(act->sa_mask) points to uninitialised byte(s)
==32132==    at 0x487D800: __libc_sigaction (sigaction.c:58)
==32132==    by 0x109266: main (pth_self_kill.c:39)
==32132==  Address 0x1ffefffd08 is on thread 1's stack
==32132==  in frame #0, created by __libc_sigaction (sigaction.c:43)
==32132== 
==32132== Syscall param rt_sigaction(act->sa_flags) points to uninitialised byte(s)
==32132==    at 0x487D800: __libc_sigaction (sigaction.c:58)
==32132==    by 0x109266: main (pth_self_kill.c:39)
==32132==  Address 0x1ffefffcf8 is on thread 1's stack
==32132==  in frame #0, created by __libc_sigaction (sigaction.c:43)


That should better be fixed to ensure we have a deterministic test :).
Comment 4 Stefan Maksimovic 2019-08-12 10:31:39 UTC
Created attachment 122077 [details]
pth_self_kill.patch v2

Thanks Philippe, validating the test through memcheck slipped my mind.

I've updated the patch by initializing the variables reported by memcheck, it should be fine now.
Comment 5 Philippe Waroquiers 2019-08-12 20:23:26 UTC
(In reply to Stefan Maksimovic from comment #4)
> Created attachment 122077 [details]
> pth_self_kill.patch v2
> 
> Thanks Philippe, validating the test through memcheck slipped my mind.
> 
> I've updated the patch by initializing the variables reported by memcheck,
> it should be fine now.

The patch looks ok to me and can be pushed.
Thanks for the work