Bug 235681

Summary:	[PATCH] helgrind incorrectly flags valid pthread_cond_wait usage as recursive lock of non-recursive mutex
Product:	[Developer tools] valgrind	Reporter:	Dave Goodell <goodell>
Component:	helgrind	Assignee:	Rhys Kidd <rhyskidd>
Status:	REPORTED ---
Severity:	normal	CC:	rhyskidd
Priority:	NOR
Version First Reported In:	3.6 SVN
Target Milestone:	---
Platform:	Unlisted Binaries
OS:	macOS
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	test program to cause the helgrind error a hack-ish patch to fix the bug

Description Dave Goodell 2010-04-29 01:15:54 UTC

Created attachment 43089 [details]
test program to cause the helgrind error

Using valgrind from SVN @r11106.  This is an OS X 10.5 machine.

"uname -a" output: Darwin mcswl176.mcs.anl.gov 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 i386 MacBookPro3,1 Darwin

When using helgrind on simple code that uses pthread_cond_wait (and related functionality) in a typical manner, it complains that the non-recursive mutex associated with the cond var is recursively locked.  The message is as follows:

--------8<--------
==58084== Thread #2 was created
==58084==    at 0x227FDE: __bsdthread_create (in /usr/lib/libSystem.B.dylib)
==58084==    by 0x17738: pthread_create_WRK (hg_intercepts.c:257)
==58084==    by 0x1780C: pthread_create (hg_intercepts.c:294)
==58084==    by 0x1D43: main (foo.c:49)
==58084==
==58084== Thread #2: Bug in libpthread: recursive write lock granted on mutex/wrlock which does not support recursion
==58084==    at 0x1825F: pthread_cond_wait* (hg_intercepts.c:694)
==58084==    by 0x1B34: run_fn (foo.c:23)
==58084==    by 0x1766C: mythread_wrapper (hg_intercepts.c:221)
==58084==    by 0x228154: _pthread_start (in /usr/lib/libSystem.B.dylib)
==58084==    by 0x228011: thread_start (in /usr/lib/libSystem.B.dylib)
==58084==
--------8<--------

A simple test case (foo.c) is attached that will elicit the error.

My understanding of helgrind's implementation is limited, but the error appears to be caused by two invocations of _VG_USERREQ__HG_PTHREAD_MUTEX_LOCK_POST.  One comes directly from the cond_wait wrapper (hg_intercepts.c:685).  But the other one comes just before, apparently from the wrapper for mutex_lock (hg_intercepts.c:496).  I think that the actual system pthread_cond_wait implementation is calling pthread_mutex_lock under the hood, which is intercepted by valgrind.  Oddly, even though it's listed in [1], my test case doesn't seem to cause pthread_cond_wait to call pthread_mutex_unlock.  I'm not sure why at this point.

This might be a Darwin-only bug.  A co-worker tested with 3.5.0 on Linux/x86_64 and didn't get this error from helgrind.

I've also attached a patch that makes the problem go away, although it's entirely possible that it's not the best fix.  Perhaps a configure-time or run-time test for whether cond_wait uses the mutex functions under the hood is better?  Also, I don't know why a similar guard isn't needed on the unlock_pre client request.

A similar problem probably exists for pthread_cond_timedwait, but I haven't tested it yet.

[1] http://www.opensource.apple.com/source/Libc/Libc-498/pthreads/pthread_cond.c

Comment 1 Dave Goodell 2010-04-29 01:17:31 UTC

Created attachment 43090 [details]
a hack-ish patch to fix the bug