Version: 2.1.0 (using KDE Devel) Installed from: Compiled sources Compiler: gcc 3.2.2 OS: Linux I'm trying to use valgrind to debug a program that uses an alternative memory allocator (a garbage collector): http://www.hpl.hp.com/personal/Hans_Boehm/gc/ I've added appropriate uses of VALGRIND_MALLOCLIKE_BLOCK and VALGRIND_FREELIKE_BLOCK and written suppressions for the code inside the memory allocator itself. However now I've run into a problem relating to the valgrind threading model and how this library uses threads. The symptom is this assertion failure: valgrind: vg_scheduler.c:3236 (scheduler_sanity): Assertion `cv == ((void *)0)' failed. That's when a thread is in state WaitMX. There's usually just one thread that this could refer to, and it looks like this: Thread 3: status = WaitMX, associated_mx = 0x40235540, associated_cv = 0x817B7C4 ==11440== at 0x40222BDB: __pthread_mutex_lock (vg_libpthread.c:944) ==11440== by 0x40224DEB: sem_post (vg_libpthread.c:2343) ==11440== by 0x8127FA3: GC_suspend_handler (pthread_stop_world.c:121) ==11440== by 0x401670CF: (within /usr/local/lib/valgrind/valgrind.so) This is in a signal handler. I believe that what's happened is that this thread was waiting on a condition variable, then got signaled with pthread_kill. In the signal handler, it still has an associated_mx and associated_cv. The signal handler calls sem_post (which is supposed to be async-signal safe). In valgrind, this is implemented using a mutex, which causes the thread to go into state WaitMX. I imagine that this may overwrite the associated_mx, which would be a secondary problem. The sem_post man page says: the sem_post function is async-signal safe and can therefore be called from signal handlers. This is the only thread synchronization function provided by POSIX threads that is async-signal safe. The pthread_mutex_lock man page says: The mutex functions are not async-signal safe. What this means is that they should not be called from a signal handler. In particular, calling pthread_mutex_lock or pthread_mutex_unlock from a signal handler may deadlock the calling thread. I take this to mean that sem_post shouldn't be implemented with a mutex. (Of course maybe that's not true in the context of valgrind's threading model, I'm not really sure.) Here's a little more background on what this code is doing: when a garbage collection is performed, the collector needs to suspend all other threads (so that it can safely walk all global data, thread stacks, and heap blocks to perform the mark phase of garbage collection). To do this, it sends a signal to each thread which is handled by GC_suspend_handler. This calls sem_post to inform the signaling thread that it has suspended itself, and then calls sigsuspend(2) to wait for another signal to wake it back up. After the garbage collection is done, the threads are woken back up with another pthread_kill. I'd be willing to help fix this problem, but I've only just started reading the valgrind source, and I haven't quite figured out how to attack it yet. I've tried this on both a RedHat 9 machine and a Debian stable machine (with valgrind 2.0.0 and 2.1.0 locally built from source).
Created attachment 4051 [details] Proposed patch This patch saves and clears a thread's associated_mx and associated_cv before delivering a signal, and restores them when returning from the signal handler. So far this seems to fix my problem.
CVS commit by thughes: If a thread is waiting on a mutex or condition variable when a signal is delivered that the thread state is temporarily changed from WaitMX or WaitCV to Running while the signal handler is running. The original state is then restored when the handler returns. This patch forces the associated_mx and associated_cv values to be cleared at the same time and the original values restored afterwards. Without this the scheduler state will not be considered sane while the handler is running. This is based on a patch from Kenneth Schalk and fixes a problem he had with posting to a semaphore in a signal handler. It also allows a couple of assertions in the scheduler sanity check to be uncommented. BUG: 72082 M +2 -4 vg_scheduler.c 1.188 M +6 -0 vg_signals.c 1.95 M +9 -0 x86/signal.c 1.3 --- valgrind/coregrind/vg_scheduler.c #1.187:1.188 @@ -3237,8 +3237,6 @@ void scheduler_sanity ( void ) vg_assert(mx != NULL); } else { - /* Unfortunately these don't hold true when a sighandler is - running. To be fixed. */ - /* vg_assert(cv == NULL); */ - /* vg_assert(mx == NULL); */ + vg_assert(cv == NULL); + vg_assert(mx == NULL); } --- valgrind/coregrind/vg_signals.c #1.94:1.95 @@ -1596,4 +1596,10 @@ void VG_(deliver_signal) ( ThreadId tid, } + /* Clear the associated mx/cv information as we are no longer + waiting on anything. The original details will be restored + when the signal frame is popped. */ + tst->associated_mx = NULL; + tst->associated_cv = NULL; + /* handler gets the union of the signal's mask and the thread's mask */ --- valgrind/coregrind/x86/signal.c #1.2:1.3 @@ -110,4 +110,7 @@ typedef delivering this signal? */ ThreadStatus status; + void* /*pthread_mutex_t* */ associated_mx; + void* /*pthread_cond_t* */ associated_cv; + /* Sanity check word. Is the highest-addressed word; do not move!*/ @@ -270,4 +273,7 @@ void VGA_(push_signal_frame)(ThreadId ti frame->status = tst->status; + frame->associated_mx = tst->associated_mx; + frame->associated_cv = tst->associated_cv; + frame->magicE = 0x27182818; @@ -343,4 +349,7 @@ Int VGA_(pop_signal_frame)(ThreadId tid) tst->status = frame->status; + tst->associated_mx = frame->associated_mx; + tst->associated_cv = frame->associated_cv; + tst->sig_mask = frame->mask;