216837 – drd/tests/pth_cond_destroy_busy fails on OS X

Bug 216837 - drd/tests/pth_cond_destroy_busy fails on OS X

Summary: drd/tests/pth_cond_destroy_busy fails on OS X

Status:	ASSIGNED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (other bugs)
Version First Reported In:	3.5 SVN
Platform:	Unlisted Binaries macOS

Importance:	NOR normal
Target Milestone:	---
Assignee:	Rhys Kidd

URL:
Keywords:

Duplicates (2):	232493 244677 (view as bug list)
Depends on:
Blocks:

Reported:	2009-11-30 17:56 UTC by Alexander Potapenko
Modified:	2020-11-30 20:01 UTC (History)
CC List:	11 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
A reproducer of the VEX threading instrumentation problem on Mac OS (295.71 KB, text/plain) 2009-11-30 17:56 UTC, Alexander Potapenko	Details
The log of running Helgrind on the supplied program. (3.93 KB, text/plain) 2009-11-30 17:57 UTC, Alexander Potapenko	Details
The log of running DRD -v on the supplied program. (10.88 KB, text/plain) 2009-11-30 18:03 UTC, Alexander Potapenko	Details
The log of running Helgrind -v on the supplied program. (10.95 KB, text/plain) 2009-11-30 18:04 UTC, Alexander Potapenko	Details
A smaller reproducer -- (only 1400 lines instead of 9k) (38.00 KB, text/plain) 2009-12-01 17:15 UTC, Alexander Potapenko	Details
A 72-line reproducer exploiting NSOperationQueue (1.12 KB, application/x-freemind) 2009-12-09 18:21 UTC, Alexander Potapenko	Details
A small attachment that uses only NSOperationQueue (1.12 KB, text/plain) 2009-12-23 16:55 UTC, Alexander Potapenko	Details
A patch that adds the workq_task_start trackable event. (2.56 KB, patch) 2010-07-28 15:35 UTC, Alexander Potapenko	Details
Show Obsolete (2) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Potapenko 2009-11-30 17:56:01 UTC

Created attachment 38721 [details]
A reproducer of the VEX threading instrumentation problem on Mac OS

(My uname -a says:
Darwin mcgrind 9.7.0 Darwin Kernel Version 9.7.0: Tue Mar 31 22:52:17 PDT 2009; root:xnu-1228.12.14~1/RELEASE_I386 i386)

To reproduce the problem, compile the attached source file using:
  g++ worker_vex_test.mm   -framework Foundation

and run Helgrind and/or DRD with the runtime assertions enabled on the resulting binary (see worker_vex_test.{helgrind,drd}.log for my logs).

For some reason both tools end up trying to operate with an uninitialized thread, i.e. either VG_(running_tid) returns an incorrect result or the thread creation is instrumented incorrectly.

The reproducer is quite large (it's a part of the Chromium source code, taken from www.chromium.org). I'm working on a smaller one. There is a number of Chromium tests that hit the same issue, but they're far more complicated.

Comment 1 Alexander Potapenko 2009-11-30 17:57:41 UTC

Created attachment 38722 [details]
The log of running Helgrind on the supplied program.

The Helgrind log.

Comment 2 Alexander Potapenko 2009-11-30 18:03:00 UTC

Created attachment 38723 [details]
The log of running DRD -v on the supplied program.

Note the "drd_pre_thread_create", "drd_post_thread_create" and "leaving drd_post_thread_create" lines in the log. I've patched DRD a little so that it prints "drd_pre_thread_create" and "drd_post_thread_create" entering the corresponding functions and "leaving drd_post_thread_create" leaving leaving drd_post_thread_create(). This change does not affect DRD's functionality.

Comment 3 Alexander Potapenko 2009-11-30 18:04:13 UTC

Created attachment 38724 [details]
The log of running Helgrind -v on the supplied program.

(replaced the previous Helgrind log with a more verbose one)

Comment 4 Alexander Potapenko 2009-12-01 17:15:55 UTC

Created attachment 38744 [details]
A smaller reproducer -- (only 1400 lines instead of 9k)

Added a smaller and more readable reproducer.
If someone can verify that the bug is reproducible for him, I'd appreciate that.
Thanks!

Comment 5 Konstantin Serebryany 2009-12-04 18:00:31 UTC

This also affects ThreadSanitizer in the same way

Comment 6 Alexander Potapenko 2009-12-09 18:21:11 UTC

Created attachment 38949 [details]
A 72-line reproducer exploiting NSOperationQueue

Another reproducer that uses only NSOperationQueue.

Build: $ g++ nsop.mm -o nsop -framework Foundation

Comment 7 Alexander Potapenko 2009-12-23 16:55:15 UTC

Created attachment 39290 [details]
A small attachment that uses only NSOperationQueue



The attached code exploits NSOperationQueue from the Foundation
library. To build it, just run:

 $ g++ nsop.mm -o nsop -framework Foundation

If I run ./nsop natively, I get something like:
 2009-12-09 19:58:33.021 nsop[64614:10b] *** _NSAutoreleaseNoPool():
Object 0x106580 of class __NSFastEnumerationEnumerator autoreleased
with no pool in place - just leaking
 Stack: (0x92a01f0f 0x9290e442 0x929a41d2 0x929a4597)
 2009-12-09 19:58:33.022 nsop[64614:10b] *** _NSAutoreleaseNoPool():
Object 0x10c490 of class NSCFSet autoreleased with no pool in place -
just leaking
 Stack: (0x92a01f0f 0x9290e442 0x92926e63 0x92926b3c 0x929267b8
0x92926493 0x92926242 0x92925e3e 0x929a420f 0x929a4597)
 2009-12-09 19:58:33.022 nsop[64614:10b] *** _NSAutoreleaseNoPool():
Object 0x10ce40 of class NSCFSet autoreleased with no pool in place -
just leaking
 Stack: (0x92a01f0f 0x9290e442 0x92926e63 0x92926b3c 0x929267b8
0x92926493 0x92926242 0x92925e3e 0x929a420f 0x929a4597)
 Printer::Print()
 BYE
 Task::Run()

But neither DRD nor Helgrind cannot cope with this binary:

 $ inst/bin/valgrind  --tool=drd  /Users/glider/src/worker_vex_test/nsop 2>&1
 ==68403== drd, a thread error detector
 ...
 --68403-- /Users/glider/src/worker_vex_test/nsop:
 --68403-- dSYM directory is missing; consider using --dsymutil=yes
 2009-12-09 20:02:14.368 nsop[68403:50b] *** _NSAutoreleaseNoPool():
Object 0x1cb4d40 of class __NSFastEnumerationEnumerator autoreleased
with no pool in place - just leaking
 Stack: (0x4d1f0f 0x3de442 0x4741d2 0x474597)
 2009-12-09 20:02:14.617 nsop[68403:50b] *** _NSAutoreleaseNoPool():
Object 0x1cbcff0 of class NSCFSet autoreleased with no pool in place -
just leaking
 Stack: (0x4d1f0f 0x3de442 0x3f6e63 0x3f6b3c 0x3f67b8 0x3f6493
0x3f6242 0x3f5e3e 0x47420f 0x474597)
 2009-12-09 20:02:14.659 nsop[68403:50b] *** _NSAutoreleaseNoPool():
Object 0x1cc12e0 of class NSCFSet autoreleased with no pool in place -
just leaking
 Stack: (0x4d1f0f 0x3de442 0x3f6e63 0x3f6b3c 0x3f67b8 0x3f6493
0x3f6242 0x3f5e3e 0x47420f 0x474597)

 drd: drd_thread.c:584 (vgDrd_thread_set_vg_running_tid): Assertion
'vg_tid != VG_INVALID_THREADID' failed.
 ==68403==    at 0xF009DCBD: ???
 ==68403==    by 0xF009DF71: ???
 ...

 sched status:
   running_tid=0

 Thread 1: status = VgTs_WaitSys
 ==68403==    at 0x7DDC62: ioctl (in /usr/lib/libSystem.B.dylib)
 ==68403==    by 0x7EB00A: __smakebuf (in /usr/lib/libSystem.B.dylib)
 ==68403==    by 0x7EAF0C: __swsetup (in /usr/lib/libSystem.B.dylib)
 ==68403==    by 0x7BAF8D: __sfvwrite (in /usr/lib/libSystem.B.dylib)
 ==68403==    by 0x82618F: puts (in /usr/lib/libSystem.B.dylib)
 ==68403==    by 0x1D0E: Printer::Print() (in
/Users/glider/src/worker_vex_test/nsop)
 ==68403==    by 0x1BAD: main (in /Users/glider/src/worker_vex_test/nsop)

 Thread 2: status = VgTs_Init
 ==68403==    at 0x81E2A4: start_wqthread (in /usr/lib/libSystem.B.dylib)


Looks like Valgrind tries to execute code in a previously unknown
thread without calling the pre_thread_ll_create method for this
thread.

Comment 8 Julian Seward 2010-01-22 13:59:33 UTC

> Looks like Valgrind tries to execute code in a previously unknown
> thread without calling the pre_thread_ll_create method for this
> thread.

Yes.  So this has something to do with the wqthread_hijack nastyness,
which is a MacOS-specific thread creation intercept hook.  I'll take
a look.

Comment 9 Julian Seward 2010-03-01 18:48:40 UTC

I can reproduce it with the most recent test case, yes.

The patch below stops Helgrind asserting, by sending it the
thread_ll_create event to tell it the thread has come into existence.
DRD still asserts, for reasons which I'm not sure about.  I think it's
something to do with the thread_first_insn event, which DRD listens
for but Helgrind ignores.  Bart, can you have a look?

Note however that this is still way wrong, as per comments in the
patch, because we need to know the parent thread, but we don't.  Maybe
GregP knows a way figure out what the parent thread is here.


Index: coregrind/m_syswrap/syswrap-x86-darwin.c
===================================================================
--- coregrind/m_syswrap/syswrap-x86-darwin.c    (revision 11058)
+++ coregrind/m_syswrap/syswrap-x86-darwin.c    (working copy)
@@ -420,6 +420,16 @@
       vex = &tst->arch.vex;
       allocstack(tst->tid);
       LibVEX_GuestX86_initialise(vex);
+      /* Tell threading tools the new thread exists.  FIXME: we need
+         to know the identity (tid) of the parent thread, in order
+         that threading tools can make a dependency edge from it to
+         this thread.  But we don't know what the parent thread is.
+         Hence pass 1 (the root thread).  This is completely wrong in
+         general, and could cause large numbers of false races to be
+         reported.  In fact, it's positively dangerous; we don't even
+         know if thread 1 is still alive, and the thread checkers are
+         likely to assert if it isn't. */
+      VG_TRACK(pre_thread_ll_create, 1/*BOGUS*/, tst->tid);
    }

    // Set thread's registers
@@ -461,8 +471,12 @@
       tst->client_stack_highest_word = stack+stacksize;
       tst->client_stack_szB = stacksize;

-      // GrP fixme scheduler lock?!
-
+      // tell the tool that we are at a point after the new thread
+      // has its registers set up (so we can take a stack snapshot),
+      // but before it has executed any instructions (or, really,
+      // before it has done any memory references).
+      VG_TRACK(pre_thread_first_insn, tst->tid);
+
       // pthread structure
       ML_(notify_core_and_tool_of_mmap)(
             stack+stacksize, pthread_structsize,

Comment 10 Bart Van Assche 2010-03-01 20:43:27 UTC

(In reply to comment #9)
> I can reproduce it with the most recent test case, yes.
> 
> The patch below stops Helgrind asserting, by sending it the
> thread_ll_create event to tell it the thread has come into existence.
> DRD still asserts, for reasons which I'm not sure about.  I think it's
> something to do with the thread_first_insn event, which DRD listens
> for but Helgrind ignores.  Bart, can you have a look?

As you can see below, DRD sometimes reports an assertion failure for at least the test program that uses only NSOperationQueue. The reported assertion failure means that one of the following Valgrind client requests has been invoked by the Valgrind core with VG_(get_running_tid)() == VG_INVALID_THREADID, which is an inconsistency in the Valgrind core:
* post_mem_write
* new_mmap
* new_mem_startup
* new_mem_stack_signal

$ ./vg-in-place --tool=drd --read-var-info=yes --trace-fork-join=yes ~/valgrind-bug-216837/NSOperationQueueOnly
==19250== drd, a thread error detector
==19250== Copyright (C) 2006-2009, and GNU GPL'd, by Bart Van Assche.
==19250== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info
==19250== Command: /Users/bart/valgrind-bug-216837/NSOperationQueueOnly
==19250==
--19250-- drd_pre_thread_create creator = 0, created = 1
--19250-- drd_post_thread_create created = 1
2010-03-01 11:34:54.771 NSOperationQueueOnly[19250:50b] *** _NSAutoreleaseNoPool(): Object 0x1bf92d0 of class __NSFastEnumerationEnumerator autoreleased with no pool in place - just leaking
Stack: (0x4ce12f 0x3daec2 0x470512 0x4708d7)
2010-03-01 11:34:55.048 NSOperationQueueOnly[19250:50b] *** _NSAutoreleaseNoPool(): Object 0x1c01400 of class NSCFSet autoreleased with no pool in place - just leaking
Stack: (0x4ce12f 0x3daec2 0x3f35b3 0x3f328c 0x3f2f08 0x3f2be3 0x3f2992 0x3f258e 0x47054f 0x4708d7)
2010-03-01 11:34:55.103 NSOperationQueueOnly[19250:50b] *** _NSAutoreleaseNoPool(): Object 0x1c05a40 of class NSCFSet autoreleased with no pool in place - just leaking
Stack: (0x4ce12f 0x3daec2 0x3f35b3 0x3f328c 0x3f2f08 0x3f2be3 0x3f2992 0x3f258e 0x47054f 0x4708d7)
--19250-- drd_pre_thread_create creator = 1, created = 2
--19250-- drd_post_thread_create created = 2

drd: drd_thread.c:584 (vgDrd_thread_set_vg_running_tid): Assertion 'vg_tid != VG_INVALID_THREADID' failed.
==19250==    at 0xF009E0D5: ???
==19250==    by 0xF009E298: ???
==19250==    by 0xF008B985: ???
==19250==    by 0xF00940D8: ???
==19250==    by 0xF0125F7A: ???

sched status:
  running_tid=0

Thread 1: status = VgTs_WaitSys
==19250==    at 0x7D73E2: ioctl (in /usr/lib/libSystem.B.dylib)
==19250==    by 0x7E5386: __smakebuf (in /usr/lib/libSystem.B.dylib)
==19250==    by 0x7E5288: __swsetup (in /usr/lib/libSystem.B.dylib)
==19250==    by 0x7B46ED: __sfvwrite (in /usr/lib/libSystem.B.dylib)
==19250==    by 0x820510: puts (in /usr/lib/libSystem.B.dylib)
==19250==    by 0x1D2E: Printer::Print() (NSOperationQueueOnly.mm:7)
==19250==    by 0x1CB7: main (NSOperationQueueOnly.mm:68)

Thread 2: status = VgTs_Init
==19250==    at 0x8185D8: start_wqthread (in /usr/lib/libSystem.B.dylib)

Comment 11 Alexander Potapenko 2010-03-02 11:29:02 UTC

> Note however that this is still way wrong, as per comments in the
> patch, because we need to know the parent thread, but we don't.  Maybe
> GregP knows a way figure out what the parent thread is here.
> 

A possible solution would be to track each workqueue in the threading tool itself and keep the list of the parent threads of the tasks put into the queue (I'm assuming workqueues are real queues, aren't they?)

Comment 12 Alexander Potapenko 2010-03-02 17:47:25 UTC

I've experimented with several tasks posted into the same workqueue.
Looks like they can be executed in a random order (not necessarily the creation order).

Also, workq_ops doesn't have any parameter respecting the NSOperationQueue object that was used to post the task, i.e. IIUC all the tasks end up in a same global queue.

Comment 13 Alexander Potapenko 2010-03-02 18:49:40 UTC

I've also modified nsop.mm to generate ten tasks on two workqueues, queue1 and queue2. Each of the tasks printed its ID (even for queue1, odd for queue2) and got the following log running ThreadSanitizer with the --trace-syscalls=yes flag (grepped for "workq_\|Task"):

SYSCALL[73641,1](unix:367) workq_open()[sync] --> Success(0x16f:0x0) 
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50000, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50018, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50030, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50048, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50060, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50078, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a50090, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a500a8, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a500c0, 0 )TID: 1
SYSCALL[73641,1](unix:368) workq_ops( 1(QUEUE_ADD), 0x21a500d8, 0 )TID: 1
SYSCALL[73641,2](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50000, 0 )TID: 2
SYSCALL[73641,2](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 2
SYSCALL[73641,5](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50018, 0 )TID: 5
SYSCALL[73641,5](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 5
SYSCALL[73641,4](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50078, 0 )TID: 4
SYSCALL[73641,4](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 4
SYSCALL[73641,6](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50060, 0 )TID: 6
SYSCALL[73641,6](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 6
SYSCALL[73641,3](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50048, 0 )TID: 3
SYSCALL[73641,3](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 3
SYSCALL[73641,7](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50030, 0 )TID: 7
SYSCALL[73641,7](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 7
SYSCALL[73641,8](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a50090, 0 )TID: 8
SYSCALL[73641,8](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 8
SYSCALL[73641,2](unix:368) workq_ops() starting new workqueue item
SYSCALL[73641,2](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a500a8, 0 )TID: 2
SYSCALL[73641,2](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 2
SYSCALL[73641,10](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a500c0, 0 )TID: 10
SYSCALL[73641,10](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 10
SYSCALL[73641,9](unix:368) workq_ops( 2(QUEUE_REMOVE), 0x21a500d8, 0 )TID: 9
SYSCALL[73641,9](unix:368) workq_ops( 4(THREAD_RETURN), 0x0, 0 )TID: 9
Inside Task::Run(), id=0
Inside Task::Run(), id=1
Inside Task::Run(), id=5
Inside Task::Run(), id=4
Inside Task::Run(), id=3
Inside Task::Run(), id=2
Inside Task::Run(), id=6
Inside Task::Run(), id=7
Inside Task::Run(), id=8
Inside Task::Run(), id=9

That is, keeping a list of shadow pthread_workitem_t's used in workq_ops we are able to track the parent ID within Valgrind and pass it to the clients after the thread starts execution.

Comment 14 Julian Seward 2010-03-02 19:04:21 UTC

(In reply to comment #13)
> That is, keeping a list of shadow pthread_workitem_t's used in
> workq_ops we are able to track the parent ID within Valgrind and
> pass it to the clients after the thread starts execution.

Do you mean, it should in theory be possible, or do you have a
patch that does this now?

Comment 15 Alexander Potapenko 2010-03-03 10:16:52 UTC

I'll try making a patch soon.

Comment 16 Alexander Potapenko 2010-03-04 11:37:47 UTC

The patch below adds the track_pre_wqthread_ll_create function to the tooliface (it also includes your patch). This was enough for me to (probably) add the support for wqthreads in ThreadSanitizer.

To handle wqthreads properly, a tool doesn't need to know the parent TID of the worker thread. Instead, it should draw a happens-before arc between the place where the task is put into the queue (this is "workq_ops(1, workitem)") and the place where the task actually starts (surprisingly, this is not "workq_ops(2, workitem)", that's why we have to introduce a call to track_pre_wqthread_ll_create into wqthread_hijack). So the tool should intercept workq_ops in order to place SIGNAL(workitem) and define the track_pre_wqthread_ll_create handler to place WAIT(workitem).

I've got a concern about where actually to put the VG_TRACK(pre_wqthread_ll_create) call inside wqthread_hijack, but the current place looks good.
Also, the name "pre_wqthread_ll_create" was picked randomly and doesn't reflect its purpose well. Maybe something like "pre_workqueue_task_start" is better.


Index: include/pub_tool_tooliface.h
===================================================================
--- include/pub_tool_tooliface.h	(revision 11055)
+++ include/pub_tool_tooliface.h	(working copy)
@@ -656,6 +656,7 @@
       ll_exit (in the child's context)
 */
 void VG_(track_pre_thread_ll_create) (void(*f)(ThreadId tid, ThreadId child));
+void VG_(track_pre_wqthread_ll_create) (void(*f)(ThreadId tid, Addr workitem));
 void VG_(track_pre_thread_first_insn)(void(*f)(ThreadId tid));
 void VG_(track_pre_thread_ll_exit)   (void(*f)(ThreadId tid));
 
Index: coregrind/m_tooliface.c
===================================================================
--- coregrind/m_tooliface.c	(revision 11055)
+++ coregrind/m_tooliface.c	(working copy)
@@ -408,6 +408,7 @@
 DEF0(track_stop_client_code,      ThreadId, ULong)
 
 DEF0(track_pre_thread_ll_create,  ThreadId, ThreadId)
+DEF0(track_pre_wqthread_ll_create, ThreadId, Addr)
 DEF0(track_pre_thread_first_insn, ThreadId)
 DEF0(track_pre_thread_ll_exit,    ThreadId)
 
Index: coregrind/pub_core_tooliface.h
===================================================================
--- coregrind/pub_core_tooliface.h	(revision 11055)
+++ coregrind/pub_core_tooliface.h	(working copy)
@@ -226,6 +226,7 @@
    void (*track_stop_client_code) (ThreadId, ULong);
 
    void (*track_pre_thread_ll_create)(ThreadId, ThreadId);
+   void (*track_pre_wqthread_ll_create)(ThreadId, Addr);
    void (*track_pre_thread_first_insn)(ThreadId);
    void (*track_pre_thread_ll_exit)  (ThreadId);
Index: coregrind/m_syswrap/syswrap-x86-darwin.c
===================================================================
--- coregrind/m_syswrap/syswrap-x86-darwin.c	(revision 11055)
+++ coregrind/m_syswrap/syswrap-x86-darwin.c	(working copy)
@@ -400,7 +400,6 @@
       set the mask correctly when we finally get there. */
    VG_(sigfillset)(&blockall);
    VG_(sigprocmask)(VKI_SIG_SETMASK, &blockall, NULL);
-
    if (reuse) {
       // This thread already exists; we're merely re-entering 
       // after leaving via workq_ops(WQOPS_THREAD_RETURN). 
@@ -420,6 +419,16 @@
       vex = &tst->arch.vex;
       allocstack(tst->tid);
       LibVEX_GuestX86_initialise(vex);
+      /* Tell threading tools the new thread exists.  FIXME: we need
+         to know the identity (tid) of the parent thread, in order
+         that threading tools can make a dependency edge from it to
+         this thread.  But we don't know what the parent thread is.
+         Hence pass 1 (the root thread).  This is completely wrong in
+         general, and could cause large numbers of false races to be
+         reported.  In fact, it's positively dangerous; we don't even
+         know if thread 1 is still alive, and the thread checkers are
+         likely to assert if it isn't. */
+      VG_TRACK(pre_thread_ll_create, 1/*BOGUS*/, tst->tid);
    }
         
    // Set thread's registers
@@ -437,6 +446,7 @@
    stacksize = 512*1024;  // wq stacks are always DEFAULT_STACK_SIZE
    stack = VG_PGROUNDUP(sp) - stacksize;
 
+   VG_TRACK(pre_wqthread_ll_create, tst->tid, workitem);
    if (reuse) {
        // Continue V's thread back in the scheduler. 
        // The client thread is of course in another location entirely.
@@ -461,7 +471,11 @@
       tst->client_stack_highest_word = stack+stacksize;
       tst->client_stack_szB = stacksize;
 
-      // GrP fixme scheduler lock?!
+      // tell the tool that we are at a point after the new thread
+      // has its registers set up (so we can take a stack snapshot),
+      // but before it has executed any instructions (or, really,
+      // before it has done any memory references).
+      VG_TRACK(pre_thread_first_insn, tst->tid);
       
       // pthread structure
       ML_(notify_core_and_tool_of_mmap)(

Comment 17 Bart Van Assche 2010-03-28 19:29:14 UTC

*** Bug 232493 has been marked as a duplicate of this bug. ***

Comment 18 Julian Seward 2010-07-22 11:20:22 UTC

(In reply to comment #16)
> The patch below adds the track_pre_wqthread_ll_create function to the tooliface
> (it also includes your patch). This was enough for me to (probably) add the
> support for wqthreads in ThreadSanitizer.

Alexander, what is the status of this patch now?  Now that you have
(I assume) used it for a few months, is it still correct?  Or do you
have a revised/modified version?

> (it also includes your patch)

Is this necessary?  As per the comments in my patch ..

> +      /* Tell threading tools the new thread exists.  FIXME: we need
> +         to know the identity (tid) of the parent thread, in order
> +         that threading tools can make a dependency edge from it to
> +         this thread.  But we don't know what the parent thread is.
> +         Hence pass 1 (the root thread).  This is completely wrong in
> +         general, and could cause large numbers of false races to be
> +         reported.  In fact, it's positively dangerous; we don't even
> +         know if thread 1 is still alive, and the thread checkers are
> +         likely to assert if it isn't. */
> +      VG_TRACK(pre_thread_ll_create, 1/*BOGUS*/, tst->tid);
>     }

.. it is completely incorrect, it just (sometimes) stops hg crashing,
so I don't want to commit your patch with this kludge in it.

Comment 19 Alexander Potapenko 2010-07-23 09:16:35 UTC

> Alexander, what is the status of this patch now?  Now that you have
> (I assume) used it for a few months, is it still correct?  Or do you
> have a revised/modified version?
Yes, we're using this interface function in ThreadSanitizer, and it is still correct.
> 
> Is this necessary?  As per the comments in my patch ..
> .. it is completely incorrect, it just (sometimes) stops hg crashing,
> so I don't want to commit your patch with this kludge in it.

Well, as long as we don't mind hg crashing, I can remove this from the patch (I can make up the new one in a couple of days, as I've got restricted internet access now). In fact we don't use this path in ThreadSanitizer anyway.

Comment 20 Alexander Potapenko 2010-07-28 15:35:18 UTC

Created attachment 49583 [details]
A patch that adds the workq_task_start trackable event.

To handle workqueue tasks correctly the clients must:
 -- intercept workq_ops(1 /* QUEUE_ADD */, workitem, _) to handle the task creation
 -- define VG_(track_workq_task_start) to handle the start of the task execution. The handler receives the worker thread ID and the workitem address
 -- draw a happens-before arc between these events

Comment 21 Bart Van Assche 2011-03-06 18:55:51 UTC

*** Bug 244677 has been marked as a duplicate of this bug. ***

Comment 22 Rhys Kidd 2014-10-08 09:44:29 UTC

An even smaller reproducing testcase can be found with the following Valgrind test on Mac OX 10.9 (darwin13):

$ perl tests/vg_regtest helgrind/tests/pth_cond_destroy_busy

drd: ./drd_thread.c:675 (void vgDrd_thread_set_vg_running_tid(const ThreadId)): Assertion 'vg_tid != VG_INVALID_THREADID' failed.
...

Comment 23 Rhys Kidd 2014-10-08 09:55:21 UTC

Sorry, that should be:

$ perl tests/vg_regtest drd/tests/pth_cond_destroy_busy 
drd: ./drd_thread.c:675 (void vgDrd_thread_set_vg_running_tid(const ThreadId)): Assertion 'vg_tid != VG_INVALID_THREADID' failed. 
...

Comment 24 Rhys Kidd 2015-07-19 07:22:09 UTC

As of r15417 the related helgrind/tests/pth_cond_destroy_busy test is not run on Darwin, so remove as a blocker to BZ#344416.

However, this underlying test still fails and exposes a problem which should be fixed.

Comment 25 Halil Sen 2016-07-27 13:18:50 UTC

Is there any plans to fix this bug? A simple single thread "hello world" application produces the following output on osx 10.11 with valgrind 3.11

valgrind --tool=drd ./bin/Debug/app

==73482== drd, a thread error detector
==73482== Copyright (C) 2006-2015, and GNU GPL'd, by Bart Van Assche.
==73482== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==73482== Command: ./bin/Debug/app
==73482== 

drd: ./drd_thread.c:749 (void vgDrd_thread_set_vg_running_tid(const ThreadId)): Assertion 'vg_tid != VG_INVALID_THREADID' failed.

host stacktrace:
==73482==    at 0x23802846E: ???
==73482==    by 0x23802888C: ???
==73482==    by 0x23802886A: ???
==73482==    by 0x23800CC52: ???
==73482==    by 0x2380150FF: ???
==73482==    by 0x2380311D5: ???
==73482==    by 0x23802F250: ???
==73482==    by 0x23802F193: ???

sched status:
  running_tid=0

Thread 1: status = VgTs_Init (lwpid 0)
Segmentation fault: 11

Comment 26 Rhys Kidd 2016-08-10 06:50:56 UTC

It just comes down to having sufficient time.
Put another way, I'd always be happy to review patches from others...