Bug 445504 - Using C++ condition_variable results in bogus "mutex is locked simultaneously by two threads" warning
Summary: Using C++ condition_variable results in bogus "mutex is locked simultaneously...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: drd (show other bugs)
Version: unspecified
Platform: Debian stable Linux
: NOR normal
Target Milestone: ---
Assignee: Bart Van Assche
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-14 21:30 UTC by Nikolaus
Modified: 2021-11-19 07:59 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Testcase (1.07 KB, text/x-c++src)
2021-11-14 21:30 UTC, Nikolaus
Details
patch for pthread_cond_clockwait helgrind/DRD intercepts (4.82 KB, patch)
2021-11-15 20:59 UTC, Paul Floyd
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nikolaus 2021-11-14 21:30:08 UTC
Created attachment 143555 [details]
Testcase

$ g++ -pthread -Wall -o valgrind_test valgrind_test.cpp && valgrind --tool=drd  ./valgrind_test
==867712== drd, a thread error detector
==867712== Copyright (C) 2006-2017, and GNU GPL'd, by Bart Van Assche.
==867712== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==867712== Command: ./valgrind_test
==867712== 
Other thread: waiting for notify
==867712== The impossible happened: mutex is locked simultaneously by two threads: mutex 0x1fff000080, recursion count 1, owner 2.
==867712==    at 0x48428B4: pthread_mutex_lock (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_drd-amd64-linux.so)
==867712==    by 0x10B917: __gthread_mutex_lock(pthread_mutex_t*) (in /home/nikratio/consulting/valve/steamfs/valgrind_test)
==867712==    by 0x10BB5F: std::mutex::lock() (in <redacted>/valgrind_test)
==867712==    by 0x10C633: std::lock_guard<std::mutex>::lock_guard(std::mutex&) (in <redacted>/valgrind_test)
Comment 1 Nikolaus 2021-11-14 21:31:07 UTC
For helgrind, we get a similar error:

==868848== Thread #1 unlocked lock at 0x1FFF000070 currently held by thread #2
==868848==    at 0x483D156: ??? (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)
==868848==    by 0x10B946: __gthread_mutex_unlock(pthread_mutex_t*) (in <redacted>/valgrind_test)
==868848==    by 0x10BB8D: std::mutex::unlock() (in <redacted>/valgrind_test)
==868848==    by 0x10C652: std::lock_guard<std::mutex>::~lock_guard() (in <redacted>/valgrind_test)
Comment 2 Nikolaus 2021-11-14 21:34:58 UTC
With Valgrind 3.18.1, the error becomes

==872160== Helgrind, a thread error detector
==872160== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
==872160== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==872160== Command: ./valgrind_test
==872160== 
[...]
==872160== Thread #1 unlocked lock at 0x1FFF000090 currently held by thread #2
==872160==    at 0x4841D06: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==872160==    by 0x10B946: __gthread_mutex_unlock(pthread_mutex_t*) (in <redacted>/valgrind_test)
==872160==    by 0x10BB8D: std::mutex::unlock() (in <redacted>/valgrind_test)
==872160==    by 0x10C652: std::lock_guard<std::mutex>::~lock_guard() (in <redacted>/valgrind_test)


and

==872482== drd, a thread error detector
==872482== Copyright (C) 2006-2020, and GNU GPL'd, by Bart Van Assche.
==872482== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==872482== Command: ./valgrind_test
==872482== 
Other thread: waiting for notify
==872482== The impossible happened: mutex is locked simultaneously by two threads: mutex 0x1fff000090, recursion count 1, owner 2.
==872482==    at 0x484B478: pthread_mutex_lock (in /usr/libexec/valgrind/vgpreload_drd-amd64-linux.so)
==872482==    by 0x10B917: __gthread_mutex_lock(pthread_mutex_t*) (in <redacted>/valgrind_test)
==872482==    by 0x10BB5F: std::mutex::lock() (in <redacted>/valgrind_test)
==872482==    by 0x10C633: std::lock_guard<std::mutex>::lock_guard(std::mutex&) (in <redacted>/valgrind_test)
Comment 3 Paul Floyd 2021-11-15 09:14:22 UTC
I'm not sure if it matters much, but which version of GCC is this with?
Comment 4 Nikolaus 2021-11-15 10:02:45 UTC
I was able to reproduce with:

> clang++ --version
Debian clang version 11.0.1-2
Target: x86_64-pc-linux-gnu
Thread model: posix

and

> g++ --version
g++ (Debian 10.2.1-6) 10.2.1 20210110

and

> ~/tmp/clang+llvm-13.0.0-x86_64-linux-gnu-ubuntu-20.04/bin/clang++ --version
clang version 13.0.0 (https://github.com/llvm/llvm-project/ 24c8eaec9467b2aaf70b0db33a4e4dd415139a50)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Comment 5 Paul Floyd 2021-11-15 10:29:54 UTC
I'm trying to understand how this is working.

Main thread:                                                                                      other_thread:
Create 'state' on the stack, initializes mutex and CV
Create a thread, other_thread, via std::async
Wait for 1 second
                                                                                                           lock mutex
                                                                                                           wait for 3 second on CV, also unlocks mutex
lock mutex
2 vector push backs
unlock mutex
CV notify
                                                                                                            CV notified
                                                                                                            break and fall off end of thread function
cleanup

I don't get the same errors (GCC 9.3 on RHEL7)

With DRD (I've trimmed loads of stack):
==494== Command: ./valgrind_test                                                                                                                            
==494==                                                                                                                                                     
Other thread: waiting for notify                                                                                                                            
==494== Probably a race condition: condition variable 0x1ffeffeb00 has been signaled but the associated mutex 0x1ffeffeac0 is not locked by the signalling thread.     
    by 0x4027CE: main (valgrind_test.cpp:53)    
 cond 0x1ffeffeaf0 was first observed at:    
    by 0x4025A7: other_thread(state*) (valgrind_test.cpp:22)     
mutex 0x1ffeffeab0 was first observed at:
    by 0x402572: other_thread(state*) (valgrind_test.cpp:20)



With Helgrind:
==677== Thread #1: pthread_cond_{signal,broadcast}: dubious: associated lock is not held by any thread
==677==    at 0x4036146: pthread_cond_broadcast_WRK (hg_intercepts.c:1501)
==677==    by 0x4AEA0E8: __gthread_cond_broadcast (gthr-default.h:853)
==677==    by 0x4AEA0E8: std::condition_variable::notify_all() (condition_variable.cc:73)
==677==    by 0x402778: main (valgrind_test.cpp:48)

If I comment out the braces around the lock_guard and push_backs in main then I get no error.
Comment 6 Paul Floyd 2021-11-15 10:31:01 UTC
I will try tonight with various combinations of clang, GCC, Fedora and FreeBSD.
Comment 7 Paul Floyd 2021-11-15 12:46:28 UTC
At lunch time even. Reproduced with Fedora 34 and g++ 11.2.1

If I run the executable through ltrace I see

paulf> /usr/bin/ltrace ./valgrind_test 2>&1 | c++filt | grep pthread
pthread_mutex_lock(0x7ffd797174b0, 0x7ffd797174b0, 0x4022d0, 0x4043a0) = 0
pthread_mutex_unlock(0x7ffd797174b0, 0x7ffd797173f8, 0x4022d0, 5582) = 0

Seems strange to me - I'm expecting 2 locks and 2 unlocks.

and

paulf> nm -C valgrind_test | grep pthread
                 U pthread_cond_clockwait@GLIBC_2.30
                 U pthread_create@GLIBC_2.2.5
                 w __pthread_key_create@GLIBC_2.2.5
                 w pthread_mutex_lock@GLIBC_2.2.5
                 w pthread_mutex_unlock@GLIBC_2.2.5
                 w pthread_once@GLIBC_2.2.5
00000000004029a1 t __gthread_mutex_lock(pthread_mutex_t*)
00000000004029d0 t __gthread_mutex_unlock(pthread_mutex_t*)

I'm a bit suspicious of those weak symbols and _gthread wrappers. Next thing is to check when these errors are not present.
Comment 8 Paul Floyd 2021-11-15 13:50:49 UTC
g++ 9.3 gives

nm -C valgrind_test | grep pthread
00000000004029b2 t __gthread_mutex_lock(pthread_mutex_t*)
00000000004029e1 t __gthread_mutex_unlock(pthread_mutex_t*)
0000000000402a10 t __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*)
                 w __pthread_key_create@@GLIBC_2.2.5
                 w pthread_cond_timedwait@@GLIBC_2.3.2
                 U pthread_create@@GLIBC_2.2.5
                 w pthread_mutex_lock@@GLIBC_2.2.5
                 w pthread_mutex_unlock@@GLIBC_2.2.5
                 w pthread_once@@GLIBC_2.2.5
Comment 9 Paul Floyd 2021-11-15 14:03:38 UTC
Looks like the difference is pthread_cond_timedwait vs pthread_cond_clockwait (the latter is not intercepted by DED/Helgrind).

If this is the case then adding an intercept should be fairly simple.
Comment 10 Paul Floyd 2021-11-15 20:59:02 UTC
Created attachment 143599 [details]
patch for pthread_cond_clockwait helgrind/DRD intercepts

This seems to resolve the problems seen in this testcase.

However I see that there are several 'clock' pthread functions on the way, so it may be better to implement all in one go.

See

https://sourceware.org/git/?p=glibc.git;a=log;h=65dd7e9ce36ca8485a68ed41ac731188b6473dd2
Comment 11 Nikolaus 2021-11-16 13:51:50 UTC
Thanks! I have compiled with the patch and it fixes the issues for me as well.

However, I am really confused as to why this works. Helgrind and DRD were complaining about mutexes being released by the wrong thread, and mutexes being locked twice. How can this be caused by Valgrind not being aware of pthread_cond_timedwait()? I would understand if the warnings were about data races - but this?
Comment 12 Paul Floyd 2021-11-16 13:59:44 UTC
If I understand correctly pthread_cont_timedwait (and clock version) take a locked mutex and unlock it. So by failing to intercept pthread_cond_clockwait, DRD and Helgrind were not only missing a CV wait but also a mutex unlock.
Comment 13 Paul Floyd 2021-11-18 09:46:03 UTC
My plan is to add this patch plus the testcase (which Bart has put into vgtest format). I need to do a little bit of work to get the Valgrind configure script to test for the C++ features used in the testcase.

After that I'll look at the other 'clock' pthread APIs. A quick grep of libstdc++ shows the current usage:

include/bits/std_mutex.h:      pthread_cond_clockwait(&_M_cond, __m.native_handle(), __clock,
include/std/mutex:      { return !pthread_mutex_clocklock(&_M_mutex, clockid, &__ts); }
include/std/mutex:      { return !pthread_mutex_clocklock(&_M_mutex, clockid, &__ts); }
include/std/shared_mutex:       int __ret = pthread_rwlock_clockwrlock(&_M_rwlock, CLOCK_MONOTONIC,
include/std/shared_mutex:       int __ret = pthread_rwlock_clockrdlock(&_M_rwlock, CLOCK_MONOTONIC,

That's 3 more functions:
pthread_mutex_clocklock, pthread_rwlock_clockwrlock, pthread_rwlock_clockrdlock
Comment 14 Paul Floyd 2021-11-19 07:58:21 UTC
Fixed with
commit 9abfed23c0d430aafb85de6397d171316c982792

I'll look into the other clock APIs and more regtests this weekend.