Bug 392331

Summary: Spurious lock not held error from inside pthread_cond_timedwait
Product: [Developer tools] valgrind Reporter: faminebadger
Component: helgrindAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: drahflow, philippe.waroquiers, pjfloyd
Priority: NOR    
Version First Reported In: 3.13.0   
Target Milestone: ---   
Platform: Gentoo Packages   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description faminebadger 2018-03-25 19:32:30 UTC
I'm getting the following error from my program:

 ==18784== Thread #4: pthread_cond_{signal,broadcast}: dubious: associated lock is not held by any thread
 ==18784==    at 0x4C335B1: pthread_cond_signal_WRK (hg_intercepts.c:1384)
 ==18784==    by 0x5CFBC3D: pthread_cond_timedwait@@GLIBC_2.3.2 (in /lib64/libpthread-2.26.so)
 ==18784==    by 0x4C36F54: pthread_cond_timedwait_WRK.constprop.1 (hg_intercepts.c:1295)
 ==18784==    by 0x110F6C: semaphore_wait (semaphores.c:113)
 ==18784==    by 0x10C983: exchange_thread (exchange_thread.c:337)
 ==18784==    by 0x4C316A6: mythread_wrapper (hg_intercepts.c:389)
 ==18784==    by 0x5CF48E9: start_thread (in /lib64/libpthread-2.26.so)
 ==18784==    by 0x600F61E: clone (in /lib64/libc-2.26.so)

However, clearly my semaphore_wait function is locking the mutex:

 111:        pthread_mutex_lock( &sem->mx );
 112:        if( !sem->data )
 113:            pthread_cond_wait( &sem->cd, &sem->mx );

Note, I've verified pthread_mutex_lock is returning success (which was the first thought I had).

The second thing I noticed is that this is a warning about pthread_cond_*SIGNAL* - which is the opposite side of the condition - I'm doing a wait here.

It therefore looks like glibc is internally calling signal from wait, and valgrind is reporting this as an error.
Comment 1 Philippe Waroquiers 2018-03-27 22:00:08 UTC
At least simple usage of pthread_cond_wait and lock is verified
by the regression tests.

Do you have a small compilable reproducer ?
Comment 2 faminebadger 2018-03-29 20:12:45 UTC
> Do you have a small compilable reproducer ?

I tried to make a cut-down version, but then it didn't reproduce.  Such is the nature of debugging races I guess.

However, it's quite clear from the glibc source code that pthread_cond_wait_common (the common implementation for both pthread_cond_wait and pthread_cond_timedwait) can call pthread_cond_signal after releasing the lock by calling __condvar_cancel_waiting, which contains:

>  __condvar_release_lock (cond, private);
>
>  if (consumed_signal)
>    {
>      /* We effectively consumed a signal even though we didn't want to.
>	 Therefore, we need to send a replacement signal.
>	 If we would want to optimize this, we could do what
>	 pthread_cond_signal does right in the critical section above.  */
>      __pthread_cond_signal (cond);
>    }
>}

It even implicitly says it's outside of the critical section.

This suggests some interaction between waiting and cancelling that wait via a signal or interrupt.

Either way though, I don't think this should be flagged as an error by valgrind, given it's purely internal to glibc, and is presumably working as intended.
Comment 3 Jens-W. Schicke-Uffmann 2022-03-27 22:11:42 UTC
I have a reproducer for this warning, which also suggests is spurious (given the source of the example):

Taking the example from https://en.cppreference.com/w/cpp/thread/condition_variable:

#include <iostream>
#include <string>
#include <thread>
#include <mutex>
#include <condition_variable>
 
std::mutex m;
std::condition_variable cv;
std::string data;
bool ready = false;
bool processed = false;
 
void worker_thread()
{
    // Wait until main() sends data
    std::unique_lock lk(m);
    cv.wait(lk, []{return ready;});
 
    // after the wait, we own the lock.
    std::cout << "Worker thread is processing data\n";
    data += " after processing";
 
    // Send data back to main()
    processed = true;
    std::cout << "Worker thread signals data processing completed\n";
 
    // Manual unlocking is done before notifying, to avoid waking up
    // the waiting thread only to block again (see notify_one for details)
    lk.unlock();
    cv.notify_one();
}
 
int main()
{
    std::thread worker(worker_thread);
 
    data = "Example data";
    // send data to the worker thread
    {
        std::lock_guard lk(m);
        ready = true;
        std::cout << "main() signals data ready for processing\n";
    }
    cv.notify_one();
 
    // wait for the worker
    {
        std::unique_lock lk(m);
        cv.wait(lk, []{return processed;});
    }
    std::cout << "Back in main(), data = " << data << '\n';
 
    worker.join();
}

... compiling and running via
% g++-11 -O4 -ggdb test.c++ -lpthread
% valgrind --error-exitcode=1 --exit-on-first-error=yes --tool=helgrind ./a.out
... gives:
[...]
==30543== ---Thread-Announcement------------------------------------------
==30543== 
==30543== Thread #1 is the program's root thread
==30543== 
==30543== ----------------------------------------------------------------
==30543== 
==30543== Thread #1: pthread_cond_{signal,broadcast}: dubious: associated lock is not held by any thread
==30543==    at 0x4847EF6: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==30543==    by 0x10A241: main (test2.c++:44)
Comment 4 Paul Floyd 2022-03-29 10:16:11 UTC
This is rather annoying, especially as it is a bog standard cppreference example and the sort of thing that is likely to occur in any C++11 or later threaded code.
Comment 5 Paul Floyd 2022-04-28 14:04:54 UTC
This message is generated from

static void evh__HG_PTHREAD_COND_SIGNAL_PRE ( ThreadId tid, void* cond )

Which looks like the wrapper function called before a pthread_cond_signal

And there is the comment

   // Hmm.  POSIX doesn't actually say that it's an error to call 
   // pthread_cond_signal with the associated mutex being unlocked.
   // Although it does say that it should be "if consistent scheduling
   // is desired."  For that reason, print "dubious" if the lock isn't
   // held by any thread.  Skip the "dubious" if it is held by some
   // other thread; that sounds straight-out wrong.

And the corresponding doc.

https://pubs.opengroup.org/onlinepubs/9699919799/

I'm not sure that we want to change this. I see two other possibilities
- user suppression
- add a command line option, something like --check_cond_signal_mutex
Comment 6 Paul Floyd 2022-04-28 14:09:11 UTC
Also TSan does not compain about this

 g++ -g -o 392331 392331.cpp -pthread -fsanitize=thread
./392331
main() signals data ready for processing
Worker thread is processing data
Worker thread signals data processing completed
Back in main(), data = Example data after processing
Comment 7 Jens-W. Schicke-Uffmann 2022-04-29 20:58:27 UTC
Regarding user suppression: It would already be very helpful if it could get its own category and would not be reported as Helgrind:Misc

Currently, I'm having a suppression which is
{
  <this bug URL>
  Helgrind:Misc
  ...
  fun:main
}
(no chance in being more specific due to a template function being inlined in a lot of places).

So any other Helgrind:Misc problem are now silenced as well.
Comment 8 Paul Floyd 2022-12-25 20:52:22 UTC
The commit below makes a separate category for this kind of error "Dubious". Please reopen if you also need a command line option.

commit 7d0389956e5e6ff182fdf4c2a3a9f8a202a927d4 (HEAD -> master, origin/master, origin/HEAD)
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sun Dec 25 21:43:36 2022 +0100

    Bug 392331  - Spurious lock not held error from inside pthread_cond_timedwait
    
    Added a "Dubious" error category to cover this kind of error.