Bug 358213

Summary:	helgrind/drd bar_bad testcase hangs or crashes with new glibc pthread barrier implementation
Product:	[Developer tools] valgrind	Reporter:	Mark Wielaard <mark>
Component:	helgrind	Assignee:	Julian Seward <jseward>
Status:	RESOLVED FIXED
Severity:	normal	CC:	alex.kanavin, ivosh, mips32r2, philippe.waroquiers
Priority:	NOR
Version:	3.11 SVN
Target Milestone:	---
Platform:	Other
OS:	Linux
Latest Commit:		Version Fixed In:
Attachments:	pthread_barrier-vs-newer-glibc-implementation pthread_barrier-vs-newer-glibc-implementation-drd-helgrind Make bar_bad test more deterministic. make bar_bad test more deterministic, attempt two

Description Mark Wielaard 2016-01-19 15:13:20 UTC

glibc 2.23-pre got a new pthread barrier implementation:
https://sourceware.org/ml/libc-alpha/2016-01/msg00338.html
It reacts differently on bad usage of a barrier like we do in bar_bad.vgtest to see whether helgrind detects the bad usage. In particular it will hang the bar_bad testcase.

Reproducible: Always

Comment 1 Mark Wielaard 2016-01-19 15:15:15 UTC

Created attachment 96738 [details]
pthread_barrier-vs-newer-glibc-implementation

These changes make the bar_bad testcase PASS against both the old and new implementation.

Comment 2 Mark Wielaard 2016-01-20 13:01:00 UTC

The patch to the testscase only works for helgrind, but the test program is also used under drd. It needs some different changes or you will get two failures:

drd/tests/bar_bad                        (stderr)
drd/tests/bar_bad_xml                    (stderr)

Comment 3 Mark Wielaard 2016-01-21 12:10:02 UTC

The problem fixing this for drd is that DRD_(barrier_destroy) is called after pthread_barrier_destroy. But that call might now hang and then bar_bad invokes a watchdog to exit the program, so the post handler is never called.

Comment 4 Mark Wielaard 2016-01-21 12:51:28 UTC

Created attachment 96765 [details]
pthread_barrier-vs-newer-glibc-implementation-drd-helgrind

Simplest option to solve the drd issue is to do like in the helgrind case and have two variants of the exp files. One of which will have pthread_barrier_destroy hang, and so then drd will just not see that it wasn't a barrier. Which isn't ideal, but since it is the last test in the batch it seems it doesn't impact things too much.

Comment 5 Mark Wielaard 2016-01-21 14:55:20 UTC

Note that patch is missing the drd/test/Makefile.am addition of the two extra exp file:

diff --git a/drd/tests/Makefile.am b/drd/tests/Makefile.am
index 2885391..cfd74d0 100644
--- a/drd/tests/Makefile.am
+++ b/drd/tests/Makefile.am
@@ -81,8 +81,10 @@ EXTRA_DIST =                                        \
        atomic_var.stderr.exp                       \
        atomic_var.vgtest                           \
        bar_bad.stderr.exp                          \
+       bar_bad.stderr.exp-nohang                   \
        bar_bad.vgtest                              \
        bar_bad_xml.stderr.exp                      \
+       bar_bad_xml.stderr.exp-nohang               \
        bar_bad_xml.vgtest                          \
        bar_trivial.stderr.exp                      \
        bar_trivial.stdout.exp                      \

Comment 6 Mark Wielaard 2016-09-19 14:20:55 UTC

I checked in a workaround for the hang based on the attachement as valgrind svn r15962.

This does make sure that the tests don't hang indefenitely. But they do introduce (more) non-determinism that occassionally causes these tests to fail or even trigger an internal drd assert (drd_barrier.c:352 (vgDrd_barrier_pre_wait): Assertion 'p' failed.)

Comment 7 Julian Seward 2016-10-19 12:02:40 UTC

Should we close this now?

Comment 8 Mark Wielaard 2016-10-19 12:09:38 UTC

(In reply to Julian Seward from comment #7)
> Should we close this now?

No, I don't think it should. There is now a workaround in place that makes sure the test doesn't hang. But now that test (non-deterministically) fails or even crashes valgrind itself.

Comment 9 Petar Jovanovic 2016-11-09 16:24:43 UTC

Created attachment 102144 [details]
Make bar_bad test more deterministic.

At some MIPS boards, thread slp2 may end before ext1 ends and that makes the test fail again. One of my colleagues suggested a patch in which the main thread waits for slp2 termination. This makes the test more deterministic.

Comment 10 Petar Jovanovic 2016-11-15 18:20:28 UTC

Created attachment 102247 [details]
make bar_bad test more deterministic, attempt two

The previous patch will break DRD. Instead of pthread_join(), we can experiment with pthread_cancel().

Comment 11 Petar Jovanovic 2016-11-23 17:45:14 UTC

(In reply to Petar Jovanovic from comment #10)
> Created attachment 102247 [details]
> make bar_bad test more deterministic, attempt two
> 
> The previous patch will break DRD. Instead of pthread_join(), we can
> experiment with pthread_cancel().

This patch, slightly modified, was committed in r16154.

Comment 12 Philippe Waroquiers 2016-11-29 22:54:54 UTC

Was fixed in r16154, fix announced in NEWS revision 16165

Comment 13 Alexander Kanavin 2020-12-12 21:42:45 UTC

Sadly I am still seeing the sporadic failures in bar_bad/bar_bad_xml. Reported here:
https://bugs.kde.org/show_bug.cgi?id=430321