Bug 226116 - valgrind (--tool=none) hangs after main has exited if there are starting threads
Summary: valgrind (--tool=none) hangs after main has exited if there are starting threads
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (other bugs)
Version First Reported In: 3.6 SVN
Platform: Unlisted Binaries Linux
: HI normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-10 08:51 UTC by Konstantin Serebryany
Modified: 2011-03-05 16:52 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Konstantin Serebryany 2010-02-10 08:51:02 UTC
[ A rather annoying bug as it does not allow to setup a continuous build with valgrind ]

Discussion: 
http://sourceforge.net/mailarchive/forum.php?thread_name=201002092143.15581.jseward@acm.org&forum_name=valgrind-developers

I've minimized the problem to a small test (below).  
It spawns many threads and doesn't join them before exiting. 
It will hang (or loop forever) one out of 40-100 runs: 
% g++ -g -lpthread hang.cc
% for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind --tool=none --trace-syscalls=yes --trace-signals=yes  -q ./a.out 2> $i.log ; done

Even nulgrind is affected. 

Any suggestion? 

Thanks, 

--kcc 

--------------------------------------------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

void *run(void *p) {
  for (int i = 0; ; i++) {
    usleep(100);
    fprintf(stderr, "T=%d i=%d\n", (int)pthread_self(), i);
  }
  return NULL;
}

int main(int argc, char** argv) {
  for (int i = 0; i < 200; i++) {
    pthread_t t;
    pthread_create(&t, NULL, run, NULL);
  }
  fprintf(stderr, "exiting main\n");
  return 0;
}
Comment 1 Dan Kegel 2010-02-10 19:04:06 UTC
Confirming.  The script
i=1
while true
do
    /usr/local/valgrind-11038/bin/valgrind  --tool=none --trace-syscalls=yes --trace-signals=yes  -q  ./a.out > hang$i.log 2>&1
    i=`expr $i + 1`
done

and valgrind r11038 with no patches, on an 8 core monster (hp Z600) running
32 bit karmic, hung at i=27, i=7, i=34 on three of five tries, and crashed with
valgrind: m_libcprint.c:398 (add_to__vmessage_buf): Assertion 'b->buf_used >= 0 && b->buf_used < sizeof(b->buf)-128' failed.
on my third and fourth tries at i=4 and i=5.
Comment 2 Julian Seward 2010-02-15 12:02:59 UTC
Definitely need to fix this.
Comment 3 Julian Seward 2010-02-22 00:05:41 UTC
Kostya, Dan, can you try the following patch?  It appears to cause
thread exiting to work reliably for me, even with the delay loop in
place.  (Note, this is for amd64-linux only; Dan; if you want to try
on 32-bit, you'll need to make the equivalent change in
syswrap-x86-linux instead.)

If this works for you, I'll put up a proper patch for more extensive
testing -- I noticed something else w.r.t. thread creation that needs
to be fixed really, but the patch below doesn't include that fix.
This is all a bit hairy so your multicore/multiprocess bashing on it
is appreciated.

===================================================================
--- coregrind/m_syswrap/syswrap-amd64-linux.c   (revision 11050)
+++ coregrind/m_syswrap/syswrap-amd64-linux.c   (working copy)
@@ -251,6 +251,13 @@
    ctst->sig_mask = ptst->sig_mask;
    ctst->tmp_sig_mask = ptst->sig_mask;

+   // PROVISIONAL FIX: start with my threadgroup being the same
+   // as my parents, so that any exit_group calls that happen before
+   // this thread actually sets its threadgroup for real (which
+   // happens in thread_wrapper in syswrap-linux.c) will kill
+   // the new thread.
+   ctst->os_state.threadgroup = ptst->os_state.threadgroup;
+
    /* We don't really know where the client stack is, because its
       allocated by the client.  The best we can do is look at the
       memory mappings and try to derive some useful information.  We
Comment 4 Konstantin Serebryany 2010-02-22 10:21:43 UTC
With this patch the test did not hand after 10k runs (while w/o this patch it hangs after 20-40 runs)
Comment 5 Dan Kegel 2010-02-22 10:23:00 UTC
I don't have access to the 8 core machine anymore, but thestig does, ask him
if you want it tortured.
Comment 6 Konstantin Serebryany 2010-02-22 10:36:52 UTC
I've run this patch on 4-way and 8-way machines (64-bits). Works.
Comment 7 Julian Seward 2010-02-22 12:03:46 UTC
Committed (r11053).