On both ppc64le and arm64 I am seeing drd/tests/pth_barrier_thr_cr fail: --- pth_barrier_thr_cr.stderr.exp 2021-10-28 17:52:34.497619030 -0400 +++ pth_barrier_thr_cr.stderr.out 2021-10-28 18:04:55.051452953 -0400 @@ -1,4 +1,15 @@ +Thread 15: +Number of concurrent pthread_barrier_wait() calls exceeds the barrier count: barrier 0x........ + at 0x........: pthread_barrier_wait (drd_pthread_intercepts.c:?) + by 0x........: thread (pth_barrier_thr_cr.c:?) + by 0x........: vgDrd_thread_wrapper (drd_pthread_intercepts.c:?) + by 0x........: start_thread + by 0x........: clone (in /...libc...) +barrier 0x........ was first observed at: + at 0x........: pthread_barrier_init (drd_pthread_intercepts.c:?) + by 0x........: main (pth_barrier_thr_cr.c:?) + Done. -ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) +ERROR SUMMARY: 39 errors from 1 contexts (suppressed: 0 from 0) I have seen this fail in the past on x86_64 too, but cannot replicate it there now.
Can you reproduce this with --keep-unfiltered, and does the unfiltered version help at all?
I also see intermittent failures on amd64 fedora 36. ==158787== Thread 55: ==158787== Number of concurrent pthread_barrier_wait() calls exceeds the barrier count: barrier 0x4aa8030 ==158787== at 0x4865883: pthread_barrier_wait_intercept (drd_pthread_intercepts.c:1414) ==158787== by 0x4865883: pthread_barrier_wait@* (drd_pthread_intercepts.c:1421) ==158787== by 0x4011F4: thread (pth_barrier_thr_cr.c:21) ==158787== by 0x48518B4: vgDrd_thread_wrapper (drd_pthread_intercepts.c:491) ==158787== by 0x492FDEC: start_thread (in /usr/lib64/libc.so.6) ==158787== by 0x49B4523: clone (in /usr/lib64/libc.so.6) I don't really see why this has changed. There were changes to libc/libthread back in May/July 2021 when there was a big move frpoom libpthread to libc. We're still intercepting OK. This wasn't long after landing the FreeBSD. I can't see anything that could be wrong. With --trace-barriers I see .==170352== [52] barrier_pre_wait pthread barrier 0x4aa8030 iteration 25 ==170352== [50] barrier_post_wait pthread barrier 0x4aa8030 iteration 24 .==170352== [53] barrier_pre_wait pthread barrier 0x4aa8030 iteration 25 ==170352== [53] barrier_post_wait pthread barrier 0x4aa8030 iteration 25 (serializing) .==170352== [54] barrier_pre_wait pthread barrier 0x4aa8030 iteration 26 .==170352== [55] barrier_pre_wait pthread barrier 0x4aa8030 iteration 26 ==170352== [55] barrier_post_wait pthread barrier 0x4aa8030 iteration 25 (serializing) ==170352== Thread 55: ==170352== Number of concurrent pthread_barrier_wait() calls exceeds the barrier count: barrier 0x4aa8030 ==170352== at 0x4865883: pthread_barrier_wait_intercept (drd_pthread_intercepts.c:1414) ==170352== by 0x4865883: pthread_barrier_wait@* (drd_pthread_intercepts.c:1421) ==170352== by 0x4011F4: thread (pth_barrier_thr_cr.c:21) ==170352== by 0x48518B4: vgDrd_thread_wrapper (drd_pthread_intercepts.c:491) ==170352== by 0x492FDEC: start_thread (pthread_create.c:442) ==170352== by 0x49B4523: clone (clone.S:100) ==170352== barrier 0x4aa8030 was first observed at: ==170352== at 0x4864615: pthread_barrier_init_intercept (drd_pthread_intercepts.c:1376) ==170352== by 0x4864615: pthread_barrier_init@* (drd_pthread_intercepts.c:1384) ==170352== by 0x401266: main (pth_barrier_thr_cr.c:34) ==170352== ==170352== [54] barrier_post_wait pthread barrier 0x4aa8030 iteration 26 ==170352== [52] barrier_post_wait pthread barrier 0x4aa8030 iteration 26 .==170352== [56] barrier_pre_wait pthread barrier 0x4aa8030 iteration 27 .==170352== [57] barrier_pre_wait pthread barrier 0x4aa8030 iteration 27 ==170352== [56] barrier_post_wait pthread barrier 0x4aa8030 iteration 27 ==170352== [57] barrier_post_wait pthread barrier 0x4aa8030 iteration 27 (serializing) The testcase creates 100 threads and a barrier with a count of 2. Every time that there are 2 waiting threads the barrier lets 2 though, one at a time. These are the 50 iterations in the traces above. There are two iteration 25s then an error and then an iteration 27. No iteration 26.
Commit 72b556ab15f1 ("drd: Improve barrier support") should fix this bug.