Created attachment 97908 [details] Helgrind bug reproducer I ran into what I think might be a false data race detection in helgrind with one of our products. A reproducer attached. Helgrind says the following when the reproducer is run on it: ==27043== Helgrind, a thread error detector ==27043== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al. ==27043== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==27043== Command: ./helgrind_bug_reproducer ==27043== Parent PID: 13155 ==27043== ==27043== ---Thread-Announcement------------------------------------------ ==27043== ==27043== Thread #6 was created ==27043== at 0x566F43E: clone (clone.S:74) ==27043== by 0x535E199: do_clone.constprop.3 (createthread.c:75) ==27043== by 0x535F8BA: create_thread (createthread.c:245) ==27043== by 0x535F8BA: pthread_create@@GLIBC_2.2.5 (pthread_create.c:611) ==27043== by 0x4C30E0D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x4012B8: main (helgrind_bug_reproducer.cpp:150) ==27043== ==27043== ---Thread-Announcement------------------------------------------ ==27043== ==27043== Thread #10 was created ==27043== at 0x566F43E: clone (clone.S:74) ==27043== by 0x535E199: do_clone.constprop.3 (createthread.c:75) ==27043== by 0x535F8BA: create_thread (createthread.c:245) ==27043== by 0x535F8BA: pthread_create@@GLIBC_2.2.5 (pthread_create.c:611) ==27043== by 0x4C30E0D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x4012B8: main (helgrind_bug_reproducer.cpp:150) ==27043== ==27043== ---Thread-Announcement------------------------------------------ ==27043== ==27043== Thread #4 was created ==27043== at 0x566F43E: clone (clone.S:74) ==27043== by 0x535E199: do_clone.constprop.3 (createthread.c:75) ==27043== by 0x535F8BA: create_thread (createthread.c:245) ==27043== by 0x535F8BA: pthread_create@@GLIBC_2.2.5 (pthread_create.c:611) ==27043== by 0x4C30E0D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x4012B8: main (helgrind_bug_reproducer.cpp:150) ==27043== ==27043== ---------------------------------------------------------------- ==27043== ==27043== Lock at 0x6442210 was first observed ==27043== at 0x4C321AA: pthread_mutex_init (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x401881: A::B::B(bool, unsigned long) (helgrind_bug_reproducer.cpp:97) ==27043== by 0x401542: A::method1(int) (helgrind_bug_reproducer.cpp:23) ==27043== by 0x40121A: thread(void*) (helgrind_bug_reproducer.cpp:137) ==27043== by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x535F181: start_thread (pthread_create.c:312) ==27043== by 0x566F47C: clone (clone.S:111) ==27043== Address 0x6442210 is 0 bytes inside a block of size 40 alloc'd ==27043== at 0x4C2C500: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x401852: A::B::B(bool, unsigned long) (helgrind_bug_reproducer.cpp:94) ==27043== by 0x401542: A::method1(int) (helgrind_bug_reproducer.cpp:23) ==27043== by 0x40121A: thread(void*) (helgrind_bug_reproducer.cpp:137) ==27043== by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x535F181: start_thread (pthread_create.c:312) ==27043== by 0x566F47C: clone (clone.S:111) ==27043== Block was alloc'd by thread #4 ==27043== ==27043== Lock at 0x605190 was first observed ==27043== at 0x4C2F9F5: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x401780: A::A() (helgrind_bug_reproducer.cpp:74) ==27043== by 0x40131E: __static_initialization_and_destruction_0(int, int) (helgrind_bug_reproducer.cpp:123) ==27043== by 0x401347: _GLOBAL__sub_I_a (helgrind_bug_reproducer.cpp:154) ==27043== by 0x402D0C: __libc_csu_init (in /home/ari/work/TCSG/helgrind_bug_reproducer) ==27043== by 0x5596E54: (below main) (libc-start.c:246) ==27043== Address 0x605190 is 48 bytes inside data symbol "a" ==27043== ==27043== Possible data race during write of size 8 at 0x64421B8 by thread #6 ==27043== Locks held: 1, at address 0x6442210 ==27043== at 0x4015D4: A::method1(int) (helgrind_bug_reproducer.cpp:36) ==27043== by 0x40121A: thread(void*) (helgrind_bug_reproducer.cpp:137) ==27043== by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x535F181: start_thread (pthread_create.c:312) ==27043== by 0x566F47C: clone (clone.S:111) ==27043== ==27043== This conflicts with a previous read of size 8 by thread #10 ==27043== Locks held: 2, at address 0x605190 (and 1 that can't be shown) ==27043== at 0x4016CA: A::method2(int) (helgrind_bug_reproducer.cpp:61) ==27043== by 0x401248: thread(void*) (helgrind_bug_reproducer.cpp:139) ==27043== by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x535F181: start_thread (pthread_create.c:312) ==27043== by 0x566F47C: clone (clone.S:111) ==27043== Address 0x64421b8 is 8 bytes inside a block of size 32 alloc'd ==27043== at 0x4C2C500: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x40152D: A::method1(int) (helgrind_bug_reproducer.cpp:23) ==27043== by 0x40121A: thread(void*) (helgrind_bug_reproducer.cpp:137) ==27043== by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so) ==27043== by 0x535F181: start_thread (pthread_create.c:312) ==27043== by 0x566F47C: clone (clone.S:111) ==27043== Block was alloc'd by thread #4 ==27043== ==27043== ==27043== For counts of detected and suppressed errors, rerun with: -v ==27043== Use --history-level=approx or =none to gain increased speed, at ==27043== the cost of reduced accuracy of conflicting-access information ==27043== ERROR SUMMARY: 204 errors from 1 contexts (suppressed: 1902273 from 238) The reason why I am so puzzled by this is that both on line 36 and line 61 the mutex for the element is held.
(In reply to Ari Sundholm from comment #0) > The reason why I am so puzzled by this is that both on line 36 and line 61 > the mutex for the element is held. Are you sure the lock is always held on line 36 (that does b.var2++;) ? pthread_cond_wait will release the lock. So, on the next loop, b.B_lock is not held anymore.
(In reply to Philippe Waroquiers from comment #1) > (In reply to Ari Sundholm from comment #0) > > > The reason why I am so puzzled by this is that both on line 36 and line 61 > > the mutex for the element is held. > > Are you sure the lock is always held on line 36 (that does b.var2++;) ? Yes. > pthread_cond_wait will release the lock. > So, on the next loop, b.B_lock is not held anymore. This is not the case, as, per standard semantics for condition variables, pthread_cond_wait re-acquires the mutex before returning to the caller. The description of the semantics of the function on the relevant manpage: ====================== pthread_cond_wait atomically unlocks the mutex (as per pthread_unlock_mutex) and waits for the condition variable cond to be signaled. The thread execution is suspended and does not consume any CPU time until the condition variable is signaled. The mutex must be locked by the calling thread on entrance to pthread_cond_wait. Before returning to the calling thread, pthread_cond_wait re-acquires mutex (as per pthread_lock_mutex). Unlocking the mutex and suspending on the condition variable is done atomically. Thus, if all threads always acquire the mutex before signaling the condition, this guarantees that the condition cannot be signaled (and thus ignored) between the time a thread locks the mutex and the time it waits on the condition variable. ======================
(In reply to Ari Sundholm from comment #2) > This is not the case, as, per standard semantics for condition variables, > pthread_cond_wait re-acquires the mutex before returning to the caller. Yes, you are correct, I stopped reading pthread_cond_wait manual too quickly :(. So, here is another (desperate?) trial to explain the helgrind behaviour. I have modified the program to comment out the destroy and delete of B_cond and B_lock. So, cond and locks are never deleted/destroyed. This allows then to still show which lock was held, to avoid the (and 1 that can't be shown). Then this has shown that in fact, the read had the A lock and had one B_lock, but not the same B_lock as the write operation. This then leads to the hypothesis that the memory of B might be freed and then reallocated fast enough (and get another lock) so that effectively there is some kind of race condition reported by helgrind. Or asked otherwise, how is the program ensuring that no thread is waiting on a lock of B, while in parallel another thread is delete-ing b. For sure, the delete &b; is done after the unlock of B_lock but I guess that the A lock ensures that there is no race on B when busy inserting or deleting B. In any case, here is a little bit of tracing produced by --trace-malloc=yes (this is the modified version that does not destroy B_cond/B_lock): --21045-- _ZdlPv(0x4B780E8) --21045-- _Znwj(16) = 0x4B780E8 <<<<< allocation of a B --21045-- _Znwj(24) = 0x4B7DB58 <<<<< allocation of new lock for B --21045-- _Znwj(48) = 0x4B7DBA0 --21045-- _Znwj(24) = 0x4B7DC00 --21045-- _ZdlPv(0x4B7DC00) --21045-- _ZdlPv(0x4B780E8) <<<<< free previous B --21045-- calloc(18,8) = 0x4B7DC00 --21045-- _Znwj(16) = 0x4B780E8 <<<<< allocation of a B (reallocating just freed block) --21045-- _Znwj(24) = 0x4B7DCC0 <<<<< allocation of new lock for B --21045-- _Znwj(48) = 0x4B7DD08 --21045-- _Znwj(24) = 0x4B7DD68 When using the helgrind option --free-is-write=yes, helgrind reports a race condition between a write and a previous write, which is the delete/free operation: ==21164== Possible data race during write of size 4 at 0x4B783EC by thread #7 ==21164== Locks held: 1, at address 0x4B78698 ==21164== at 0x8049178: A::method1(int) (helgrind_bug_reproducer.orig.cpp:36) ==21164== by 0x8048E06: thread(void*) (helgrind_bug_reproducer.orig.cpp:137) ==21164== by 0x402DA76: mythread_wrapper (hg_intercepts.c:389) ==21164== by 0x405BEFA: start_thread (pthread_create.c:309) ==21164== by 0x42B2EDD: clone (clone.S:129) ==21164== ==21164== This conflicts with a previous write of size 4 by thread #11 ==21164== Locks held: 1, at address 0x804CBD8 ==21164== at 0x402A868: operator delete(void*) (vg_replace_malloc.c:576) ==21164== by 0x80492AD: A::method2(int) (helgrind_bug_reproducer.orig.cpp:65) ==21164== by 0x8048E30: thread(void*) (helgrind_bug_reproducer.orig.cpp:139) ==21164== by 0x402DA76: mythread_wrapper (hg_intercepts.c:389) ==21164== by 0x405BEFA: start_thread (pthread_create.c:309) ==21164== by 0x42B2EDD: clone (clone.S:129) ==21164== Address 0x4b783ec is 4 bytes inside a block of size 16 alloc'd ==21164== at 0x40299F6: operator new(unsigned int) (vg_replace_malloc.c:328) ==21164== by 0x80490D3: A::method1(int) (helgrind_bug_reproducer.orig.cpp:23) ==21164== by 0x8048E06: thread(void*) (helgrind_bug_reproducer.orig.cpp:137) ==21164== by 0x402DA76: mythread_wrapper (hg_intercepts.c:389) ==21164== by 0x405BEFA: start_thread (pthread_create.c:309) ==21164== by 0x42B2EDD: clone (clone.S:129) ==21164== Block was alloc'd by thread #9 So, we have 2 indications that the problem might be related to the way the memory of the B objects are allocated/freed/reallocated. So that might be a bug in the way helgrind handles allocate/free/reallocate. It might also be a real race in the program, related to memory management, but not very clear how this can happen with the A lock. I am wondering also about the limitations helgrind has about the condition variables (see helgrind user manual). So, this helgrind error is not very clear to me, it looks like we need a real expert to look at this :)