Bug 358213

Summary: helgrind/drd bar_bad testcase hangs or crashes with new glibc pthread barrier implementation
Product: [Developer tools] valgrind Reporter: Mark Wielaard <mark>
Component: helgrindAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: alex.kanavin, ivosh, mips32r2, philippe.waroquiers
Priority: NOR    
Version: 3.11 SVN   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed In:
Attachments: pthread_barrier-vs-newer-glibc-implementation
pthread_barrier-vs-newer-glibc-implementation-drd-helgrind
Make bar_bad test more deterministic.
make bar_bad test more deterministic, attempt two

Description Mark Wielaard 2016-01-19 15:13:20 UTC
glibc 2.23-pre got a new pthread barrier implementation:
https://sourceware.org/ml/libc-alpha/2016-01/msg00338.html
It reacts differently on bad usage of a barrier like we do in bar_bad.vgtest to see whether helgrind detects the bad usage. In particular it will hang the bar_bad testcase.

Reproducible: Always
Comment 1 Mark Wielaard 2016-01-19 15:15:15 UTC
Created attachment 96738 [details]
pthread_barrier-vs-newer-glibc-implementation

These changes make the bar_bad testcase PASS against both the old and new implementation.
Comment 2 Mark Wielaard 2016-01-20 13:01:00 UTC
The patch to the testscase only works for helgrind, but the test program is also used under drd. It needs some different changes or you will get two failures:

drd/tests/bar_bad                        (stderr)
drd/tests/bar_bad_xml                    (stderr)
Comment 3 Mark Wielaard 2016-01-21 12:10:02 UTC
The problem fixing this for drd is that DRD_(barrier_destroy) is called after pthread_barrier_destroy. But that call might now hang and then bar_bad invokes a watchdog to exit the program, so the post handler is never called.
Comment 4 Mark Wielaard 2016-01-21 12:51:28 UTC
Created attachment 96765 [details]
pthread_barrier-vs-newer-glibc-implementation-drd-helgrind

Simplest option to solve the drd issue is to do like in the helgrind case and have two variants of the exp files. One of which will have pthread_barrier_destroy hang, and so then drd will just not see that it wasn't a barrier. Which isn't ideal, but since it is the last test in the batch it seems it doesn't impact things too much.
Comment 5 Mark Wielaard 2016-01-21 14:55:20 UTC
Note that patch is missing the drd/test/Makefile.am addition of the two extra exp file:

diff --git a/drd/tests/Makefile.am b/drd/tests/Makefile.am
index 2885391..cfd74d0 100644
--- a/drd/tests/Makefile.am
+++ b/drd/tests/Makefile.am
@@ -81,8 +81,10 @@ EXTRA_DIST =                                        \
        atomic_var.stderr.exp                       \
        atomic_var.vgtest                           \
        bar_bad.stderr.exp                          \
+       bar_bad.stderr.exp-nohang                   \
        bar_bad.vgtest                              \
        bar_bad_xml.stderr.exp                      \
+       bar_bad_xml.stderr.exp-nohang               \
        bar_bad_xml.vgtest                          \
        bar_trivial.stderr.exp                      \
        bar_trivial.stdout.exp                      \
Comment 6 Mark Wielaard 2016-09-19 14:20:55 UTC
I checked in a workaround for the hang based on the attachement as valgrind svn r15962.

This does make sure that the tests don't hang indefenitely. But they do introduce (more) non-determinism that occassionally causes these tests to fail or even trigger an internal drd assert (drd_barrier.c:352 (vgDrd_barrier_pre_wait): Assertion 'p' failed.)
Comment 7 Julian Seward 2016-10-19 12:02:40 UTC
Should we close this now?
Comment 8 Mark Wielaard 2016-10-19 12:09:38 UTC
(In reply to Julian Seward from comment #7)
> Should we close this now?

No, I don't think it should. There is now a workaround in place that makes sure the test doesn't hang. But now that test (non-deterministically) fails or even crashes valgrind itself.
Comment 9 Petar Jovanovic 2016-11-09 16:24:43 UTC
Created attachment 102144 [details]
Make bar_bad test more deterministic.

At some MIPS boards, thread slp2 may end before ext1 ends and that makes the test fail again. One of my colleagues suggested a patch in which the main thread waits for slp2 termination. This makes the test more deterministic.
Comment 10 Petar Jovanovic 2016-11-15 18:20:28 UTC
Created attachment 102247 [details]
make bar_bad test more deterministic, attempt two

The previous patch will break DRD. Instead of pthread_join(), we can experiment with pthread_cancel().
Comment 11 Petar Jovanovic 2016-11-23 17:45:14 UTC
(In reply to Petar Jovanovic from comment #10)
> Created attachment 102247 [details]
> make bar_bad test more deterministic, attempt two
> 
> The previous patch will break DRD. Instead of pthread_join(), we can
> experiment with pthread_cancel().

This patch, slightly modified, was committed in r16154.
Comment 12 Philippe Waroquiers 2016-11-29 22:54:54 UTC
Was fixed in r16154, fix announced in NEWS revision 16165
Comment 13 Alexander Kanavin 2020-12-12 21:42:45 UTC
Sadly I am still seeing the sporadic failures in bar_bad/bar_bad_xml. Reported here:
https://bugs.kde.org/show_bug.cgi?id=430321