[ A rather annoying bug as it does not allow to setup a continuous build with valgrind ] Discussion: http://sourceforge.net/mailarchive/forum.php?thread_name=201002092143.15581.jseward@acm.org&forum_name=valgrind-developers I've minimized the problem to a small test (below). It spawns many threads and doesn't join them before exiting. It will hang (or loop forever) one out of 40-100 runs: % g++ -g -lpthread hang.cc % for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> $i.log ; done Even nulgrind is affected. Any suggestion? Thanks, --kcc -------------------------------------------------------------------------------------------------------- #include <stdio.h> #include <pthread.h> #include <unistd.h> void *run(void *p) { for (int i = 0; ; i++) { usleep(100); fprintf(stderr, "T=%d i=%d\n", (int)pthread_self(), i); } return NULL; } int main(int argc, char** argv) { for (int i = 0; i < 200; i++) { pthread_t t; pthread_create(&t, NULL, run, NULL); } fprintf(stderr, "exiting main\n"); return 0; }
Confirming. The script i=1 while true do /usr/local/valgrind-11038/bin/valgrind --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out > hang$i.log 2>&1 i=`expr $i + 1` done and valgrind r11038 with no patches, on an 8 core monster (hp Z600) running 32 bit karmic, hung at i=27, i=7, i=34 on three of five tries, and crashed with valgrind: m_libcprint.c:398 (add_to__vmessage_buf): Assertion 'b->buf_used >= 0 && b->buf_used < sizeof(b->buf)-128' failed. on my third and fourth tries at i=4 and i=5.
Definitely need to fix this.
Kostya, Dan, can you try the following patch? It appears to cause thread exiting to work reliably for me, even with the delay loop in place. (Note, this is for amd64-linux only; Dan; if you want to try on 32-bit, you'll need to make the equivalent change in syswrap-x86-linux instead.) If this works for you, I'll put up a proper patch for more extensive testing -- I noticed something else w.r.t. thread creation that needs to be fixed really, but the patch below doesn't include that fix. This is all a bit hairy so your multicore/multiprocess bashing on it is appreciated. =================================================================== --- coregrind/m_syswrap/syswrap-amd64-linux.c (revision 11050) +++ coregrind/m_syswrap/syswrap-amd64-linux.c (working copy) @@ -251,6 +251,13 @@ ctst->sig_mask = ptst->sig_mask; ctst->tmp_sig_mask = ptst->sig_mask; + // PROVISIONAL FIX: start with my threadgroup being the same + // as my parents, so that any exit_group calls that happen before + // this thread actually sets its threadgroup for real (which + // happens in thread_wrapper in syswrap-linux.c) will kill + // the new thread. + ctst->os_state.threadgroup = ptst->os_state.threadgroup; + /* We don't really know where the client stack is, because its allocated by the client. The best we can do is look at the memory mappings and try to derive some useful information. We
With this patch the test did not hand after 10k runs (while w/o this patch it hangs after 20-40 runs)
I don't have access to the 8 core machine anymore, but thestig does, ask him if you want it tortured.
I've run this patch on 4-way and 8-way machines (64-bits). Works.
Committed (r11053).