Bug 72082 - sem_post in thread signal handler => assertion failure
Summary: sem_post in thread signal handler => assertion failure
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Compiled Sources Linux
: NOR crash
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-07 20:45 UTC by Kenneth C. Schalk
Modified: 2004-10-16 18:17 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments
Proposed patch (1.41 KB, patch)
2004-01-08 17:32 UTC, Kenneth C. Schalk
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kenneth C. Schalk 2004-01-07 20:45:15 UTC
Version:           2.1.0 (using KDE Devel)
Installed from:    Compiled sources
Compiler:          gcc 3.2.2 
OS:          Linux

I'm trying to use valgrind to debug a program that uses an alternative
memory allocator (a garbage collector):

http://www.hpl.hp.com/personal/Hans_Boehm/gc/

I've added appropriate uses of VALGRIND_MALLOCLIKE_BLOCK and
VALGRIND_FREELIKE_BLOCK and written suppressions for the code inside
the memory allocator itself.  However now I've run into a problem
relating to the valgrind threading model and how this library uses
threads.

The symptom is this assertion failure:

valgrind: vg_scheduler.c:3236 (scheduler_sanity): Assertion `cv == ((void *)0)' failed.

That's when a thread is in state WaitMX.  There's usually just one
thread that this could refer to, and it looks like this:

Thread 3: status = WaitMX, associated_mx = 0x40235540, associated_cv = 0x817B7C4
==11440==    at 0x40222BDB: __pthread_mutex_lock (vg_libpthread.c:944)
==11440==    by 0x40224DEB: sem_post (vg_libpthread.c:2343)
==11440==    by 0x8127FA3: GC_suspend_handler (pthread_stop_world.c:121)
==11440==    by 0x401670CF: (within /usr/local/lib/valgrind/valgrind.so)

This is in a signal handler.  I believe that what's happened is that
this thread was waiting on a condition variable, then got signaled
with pthread_kill.  In the signal handler, it still has an
associated_mx and associated_cv.  The signal handler calls sem_post
(which is supposed to be async-signal safe).  In valgrind, this is
implemented using a mutex, which causes the thread to go into state
WaitMX.  I imagine that this may overwrite the associated_mx, which
would be a secondary problem.

The sem_post man page says:

  the sem_post function is async-signal safe and can therefore be
  called from signal handlers. This is the only thread synchronization
  function provided by POSIX threads that is async-signal safe.

The pthread_mutex_lock man page says:

  The mutex functions are not async-signal safe. What this means is
  that they should not be called from a signal handler. In particular,
  calling pthread_mutex_lock or pthread_mutex_unlock from a signal
  handler may deadlock the calling thread.

I take this to mean that sem_post shouldn't be implemented with a
mutex.  (Of course maybe that's not true in the context of valgrind's
threading model, I'm not really sure.)

Here's a little more background on what this code is doing: when a
garbage collection is performed, the collector needs to suspend all
other threads (so that it can safely walk all global data, thread
stacks, and heap blocks to perform the mark phase of garbage
collection).  To do this, it sends a signal to each thread which is
handled by GC_suspend_handler.  This calls sem_post to inform the
signaling thread that it has suspended itself, and then calls
sigsuspend(2) to wait for another signal to wake it back up.  After
the garbage collection is done, the threads are woken back up with
another pthread_kill.

I'd be willing to help fix this problem, but I've only just started
reading the valgrind source, and I haven't quite figured out how to
attack it yet.

I've tried this on both a RedHat 9 machine and a Debian stable machine
(with valgrind 2.0.0 and 2.1.0 locally built from source).
Comment 1 Kenneth C. Schalk 2004-01-08 17:32:09 UTC
Created attachment 4051 [details]
Proposed patch

This patch saves and clears a thread's associated_mx and associated_cv before
delivering a signal, and restores them when returning from the signal handler. 
So far this seems to fix my problem.
Comment 2 Tom Hughes 2004-10-16 18:17:21 UTC
CVS commit by thughes: 

If a thread is waiting on a mutex or condition variable when a signal is
delivered that the thread state is temporarily changed from WaitMX or WaitCV
to Running while the signal handler is running. The original state is then
restored when the handler returns.

This patch forces the associated_mx and associated_cv values to be cleared
at the same time and the original values restored afterwards. Without this
the scheduler state will not be considered sane while the handler is running.

This is based on a patch from Kenneth Schalk and fixes a problem he had
with posting to a semaphore in a signal handler. It also allows a couple
of assertions in the scheduler sanity check to be uncommented.

BUG: 72082


  M +2 -4      vg_scheduler.c   1.188
  M +6 -0      vg_signals.c   1.95
  M +9 -0      x86/signal.c   1.3


--- valgrind/coregrind/vg_scheduler.c  #1.187:1.188
@@ -3237,8 +3237,6 @@ void scheduler_sanity ( void )
          vg_assert(mx != NULL);
       } else {
-         /* Unfortunately these don't hold true when a sighandler is
-            running.  To be fixed. */
-         /* vg_assert(cv == NULL); */
-         /* vg_assert(mx == NULL); */
+         vg_assert(cv == NULL);
+         vg_assert(mx == NULL);
       }
 

--- valgrind/coregrind/vg_signals.c  #1.94:1.95
@@ -1596,4 +1596,10 @@ void VG_(deliver_signal) ( ThreadId tid,
       }
 
+      /* Clear the associated mx/cv information as we are no longer
+         waiting on anything. The original details will be restored
+         when the signal frame is popped. */
+      tst->associated_mx = NULL;
+      tst->associated_cv = NULL;
+
       /* handler gets the union of the signal's mask and the thread's
          mask */

--- valgrind/coregrind/x86/signal.c  #1.2:1.3
@@ -110,4 +110,7 @@ typedef
          delivering this signal? */
       ThreadStatus status;
+      void* /*pthread_mutex_t* */ associated_mx;
+      void* /*pthread_cond_t* */ associated_cv;
+
       /* Sanity check word.  Is the highest-addressed word; do not
          move!*/
@@ -270,4 +273,7 @@ void VGA_(push_signal_frame)(ThreadId ti
       frame->status = tst->status;
 
+   frame->associated_mx = tst->associated_mx;
+   frame->associated_cv = tst->associated_cv;
+
    frame->magicE     = 0x27182818;
 
@@ -343,4 +349,7 @@ Int VGA_(pop_signal_frame)(ThreadId tid)
    tst->status    = frame->status;
 
+   tst->associated_mx = frame->associated_mx;
+   tst->associated_cv = frame->associated_cv;
+
    tst->sig_mask  = frame->mask;