Created attachment 143555 [details] Testcase $ g++ -pthread -Wall -o valgrind_test valgrind_test.cpp && valgrind --tool=drd ./valgrind_test ==867712== drd, a thread error detector ==867712== Copyright (C) 2006-2017, and GNU GPL'd, by Bart Van Assche. ==867712== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info ==867712== Command: ./valgrind_test ==867712== Other thread: waiting for notify ==867712== The impossible happened: mutex is locked simultaneously by two threads: mutex 0x1fff000080, recursion count 1, owner 2. ==867712== at 0x48428B4: pthread_mutex_lock (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_drd-amd64-linux.so) ==867712== by 0x10B917: __gthread_mutex_lock(pthread_mutex_t*) (in /home/nikratio/consulting/valve/steamfs/valgrind_test) ==867712== by 0x10BB5F: std::mutex::lock() (in <redacted>/valgrind_test) ==867712== by 0x10C633: std::lock_guard<std::mutex>::lock_guard(std::mutex&) (in <redacted>/valgrind_test)
For helgrind, we get a similar error: ==868848== Thread #1 unlocked lock at 0x1FFF000070 currently held by thread #2 ==868848== at 0x483D156: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so) ==868848== by 0x10B946: __gthread_mutex_unlock(pthread_mutex_t*) (in <redacted>/valgrind_test) ==868848== by 0x10BB8D: std::mutex::unlock() (in <redacted>/valgrind_test) ==868848== by 0x10C652: std::lock_guard<std::mutex>::~lock_guard() (in <redacted>/valgrind_test)
With Valgrind 3.18.1, the error becomes ==872160== Helgrind, a thread error detector ==872160== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al. ==872160== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==872160== Command: ./valgrind_test ==872160== [...] ==872160== Thread #1 unlocked lock at 0x1FFF000090 currently held by thread #2 ==872160== at 0x4841D06: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==872160== by 0x10B946: __gthread_mutex_unlock(pthread_mutex_t*) (in <redacted>/valgrind_test) ==872160== by 0x10BB8D: std::mutex::unlock() (in <redacted>/valgrind_test) ==872160== by 0x10C652: std::lock_guard<std::mutex>::~lock_guard() (in <redacted>/valgrind_test) and ==872482== drd, a thread error detector ==872482== Copyright (C) 2006-2020, and GNU GPL'd, by Bart Van Assche. ==872482== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==872482== Command: ./valgrind_test ==872482== Other thread: waiting for notify ==872482== The impossible happened: mutex is locked simultaneously by two threads: mutex 0x1fff000090, recursion count 1, owner 2. ==872482== at 0x484B478: pthread_mutex_lock (in /usr/libexec/valgrind/vgpreload_drd-amd64-linux.so) ==872482== by 0x10B917: __gthread_mutex_lock(pthread_mutex_t*) (in <redacted>/valgrind_test) ==872482== by 0x10BB5F: std::mutex::lock() (in <redacted>/valgrind_test) ==872482== by 0x10C633: std::lock_guard<std::mutex>::lock_guard(std::mutex&) (in <redacted>/valgrind_test)
I'm not sure if it matters much, but which version of GCC is this with?
I was able to reproduce with: > clang++ --version Debian clang version 11.0.1-2 Target: x86_64-pc-linux-gnu Thread model: posix and > g++ --version g++ (Debian 10.2.1-6) 10.2.1 20210110 and > ~/tmp/clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-20.04/bin/clang++ --version clang version 13.0.0 (https://github.com/llvm/llvm-project/ 24c8eaec9467b2aaf70b0db33a4e4dd415139a50) Target: x86_64-unknown-linux-gnu Thread model: posix
I'm trying to understand how this is working. Main thread: other_thread: Create 'state' on the stack, initializes mutex and CV Create a thread, other_thread, via std::async Wait for 1 second lock mutex wait for 3 second on CV, also unlocks mutex lock mutex 2 vector push backs unlock mutex CV notify CV notified break and fall off end of thread function cleanup I don't get the same errors (GCC 9.3 on RHEL7) With DRD (I've trimmed loads of stack): ==494== Command: ./valgrind_test ==494== Other thread: waiting for notify ==494== Probably a race condition: condition variable 0x1ffeffeb00 has been signaled but the associated mutex 0x1ffeffeac0 is not locked by the signalling thread. by 0x4027CE: main (valgrind_test.cpp:53) cond 0x1ffeffeaf0 was first observed at: by 0x4025A7: other_thread(state*) (valgrind_test.cpp:22) mutex 0x1ffeffeab0 was first observed at: by 0x402572: other_thread(state*) (valgrind_test.cpp:20) With Helgrind: ==677== Thread #1: pthread_cond_{signal,broadcast}: dubious: associated lock is not held by any thread ==677== at 0x4036146: pthread_cond_broadcast_WRK (hg_intercepts.c:1501) ==677== by 0x4AEA0E8: __gthread_cond_broadcast (gthr-default.h:853) ==677== by 0x4AEA0E8: std::condition_variable::notify_all() (condition_variable.cc:73) ==677== by 0x402778: main (valgrind_test.cpp:48) If I comment out the braces around the lock_guard and push_backs in main then I get no error.
I will try tonight with various combinations of clang, GCC, Fedora and FreeBSD.
At lunch time even. Reproduced with Fedora 34 and g++ 11.2.1 If I run the executable through ltrace I see paulf> /usr/bin/ltrace ./valgrind_test 2>&1 | c++filt | grep pthread pthread_mutex_lock(0x7ffd797174b0, 0x7ffd797174b0, 0x4022d0, 0x4043a0) = 0 pthread_mutex_unlock(0x7ffd797174b0, 0x7ffd797173f8, 0x4022d0, 5582) = 0 Seems strange to me - I'm expecting 2 locks and 2 unlocks. and paulf> nm -C valgrind_test | grep pthread U pthread_cond_clockwait@GLIBC_2.30 U pthread_create@GLIBC_2.2.5 w __pthread_key_create@GLIBC_2.2.5 w pthread_mutex_lock@GLIBC_2.2.5 w pthread_mutex_unlock@GLIBC_2.2.5 w pthread_once@GLIBC_2.2.5 00000000004029a1 t __gthread_mutex_lock(pthread_mutex_t*) 00000000004029d0 t __gthread_mutex_unlock(pthread_mutex_t*) I'm a bit suspicious of those weak symbols and _gthread wrappers. Next thing is to check when these errors are not present.
g++ 9.3 gives nm -C valgrind_test | grep pthread 00000000004029b2 t __gthread_mutex_lock(pthread_mutex_t*) 00000000004029e1 t __gthread_mutex_unlock(pthread_mutex_t*) 0000000000402a10 t __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) w __pthread_key_create@@GLIBC_2.2.5 w pthread_cond_timedwait@@GLIBC_2.3.2 U pthread_create@@GLIBC_2.2.5 w pthread_mutex_lock@@GLIBC_2.2.5 w pthread_mutex_unlock@@GLIBC_2.2.5 w pthread_once@@GLIBC_2.2.5
Looks like the difference is pthread_cond_timedwait vs pthread_cond_clockwait (the latter is not intercepted by DED/Helgrind). If this is the case then adding an intercept should be fairly simple.
Created attachment 143599 [details] patch for pthread_cond_clockwait helgrind/DRD intercepts This seems to resolve the problems seen in this testcase. However I see that there are several 'clock' pthread functions on the way, so it may be better to implement all in one go. See https://sourceware.org/git/?p=glibc.git;a=log;h=65dd7e9ce36ca8485a68ed41ac731188b6473dd2
Thanks! I have compiled with the patch and it fixes the issues for me as well. However, I am really confused as to why this works. Helgrind and DRD were complaining about mutexes being released by the wrong thread, and mutexes being locked twice. How can this be caused by Valgrind not being aware of pthread_cond_timedwait()? I would understand if the warnings were about data races - but this?
If I understand correctly pthread_cont_timedwait (and clock version) take a locked mutex and unlock it. So by failing to intercept pthread_cond_clockwait, DRD and Helgrind were not only missing a CV wait but also a mutex unlock.
My plan is to add this patch plus the testcase (which Bart has put into vgtest format). I need to do a little bit of work to get the Valgrind configure script to test for the C++ features used in the testcase. After that I'll look at the other 'clock' pthread APIs. A quick grep of libstdc++ shows the current usage: include/bits/std_mutex.h: pthread_cond_clockwait(&_M_cond, __m.native_handle(), __clock, include/std/mutex: { return !pthread_mutex_clocklock(&_M_mutex, clockid, &__ts); } include/std/mutex: { return !pthread_mutex_clocklock(&_M_mutex, clockid, &__ts); } include/std/shared_mutex: int __ret = pthread_rwlock_clockwrlock(&_M_rwlock, CLOCK_MONOTONIC, include/std/shared_mutex: int __ret = pthread_rwlock_clockrdlock(&_M_rwlock, CLOCK_MONOTONIC, That's 3 more functions: pthread_mutex_clocklock, pthread_rwlock_clockwrlock, pthread_rwlock_clockrdlock
Fixed with commit 9abfed23c0d430aafb85de6397d171316c982792 I'll look into the other clock APIs and more regtests this weekend.