Version: unspecified OS: Linux Certain regression test run for a very long time and possibly loop. This happens on a z900 machine. (I usually kill the process after I lose patience). Those testcases are: helgrind/tests/annotate_hbefore and helgrind/tests/pth_barrier3 Reproducible: Always
assigned to myself
I was able to reproduce the endless loop sometimes on a z800. Looks like this code [...] void* thread_fn1 ( void* arg ) { UWord* w = (UWord*)arg; delay100ms(); // ensure t2 gets to its wait first [...] does not ensure that t2 gets to its wait reliably on a slow machine. Changing delay100 to wait for 500ms helps. Still looking for a better solution.
(In reply to comment #2) > I was able to reproduce the endless loop sometimes on a z800. > Looks like this code This problem might not be specific to s390x. I also encountered infinite loop in annotate_hbefore but on some linux/amd64. I obtain an infinite loop on: a 4 CORE model name : Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz It works ok on: a 24 CORE, model name : Intel(R) Xeon(R) CPU X7460 @ 2.66GHz The kernel version on these two systems is the same (rhel 2.6.18-92.el5) Julian gave the following hypothesis: "I wonder if this is some interaction of V's unfair scheduling and the lack of a load fence in the spin loop in do_wait()." So, adding a load fence (I have no idea how to to that) might solve the problem on linux/amd64 and maybe also on s390 ?
Florian, regarding pth_barrier3: This testcase requires ~500MB. Your CDS system only had 256MB. Can you retest with the bigger system (and swap) or try to apply this patch: --- drd/tests/pth_barrier.c (revision 11793) +++ drd/tests/pth_barrier.c (working copy) @@ -81,7 +81,10 @@ t[i].b = &b; t[i].array = array; t[i].iterations = iterations; - pthread_create(&t[i].tid, 0, (void*(*)(void*))threadfunc, &t[i]); + if (pthread_create(&t[i].tid, 0, (void*(*)(void*))threadfunc, &t[i])) { + printf("Error creating all threads!\n"); + abort(); + } } for (i = 0; i < nthread; i++)
(In reply to comment #4) > Florian, > > regarding pth_barrier3: This testcase requires ~500MB. Your CDS system only had > 256MB. Can you retest with the bigger system (and swap) or try to apply this > patch: > This testcase now runs through quickly with the same diffs as your nightly runs (plus an additional size 1 vs size 4 difference due to MVC). Philippe, thanks for sharing that insight about annotate_hbefore. I used to kill this testcase (runs for hours if you let it... ) I added a serilization op in to the loop in do_wait: Index: helgrind/tests/annotate_hbefore.c =================================================================== --- helgrind/tests/annotate_hbefore.c (revision 11964) +++ helgrind/tests/annotate_hbefore.c (working copy) @@ -245,8 +245,10 @@ { UWord w0 = *w; UWord volatile * wV = w; - while (*wV == w0) + while (*wV == w0) { + asm volatile (".hword 0x07f0\n\t"); ; + } ANNOTATE_HAPPENS_AFTER(w); } which fixes the problem. The testcase now runs in 2 sec or so..
>> regarding pth_barrier3: This testcase requires ~500MB. Your CDS system only had >> 256MB. Can you retest with the bigger system (and swap) or try to apply this >> patch: >> > > This testcase now runs through quickly with the same diffs as your nightly runs > (plus an additional size 1 vs size 4 difference due to MVC). Good. We should consider to apply my patch anyway, otherwise any user of these old CDS systems will face the same endless loop. Christian
(In reply to comment #6) > > Good. We should consider to apply my patch anyway, otherwise any user of these > old CDS systems will face the same endless loop. > Done in r11967
(In reply to comment #3) > Julian gave the following hypothesis: > > "I wonder if this is some interaction of V's unfair scheduling > and the lack of a load fence in the spin loop in do_wait()." > > So, adding a load fence (I have no idea how to to that) might solve the problem > on linux/amd64 and maybe also on s390 ? As I mentioned in another comment, adding a serialization insn worked on s390x. If you want to try on x86 use one of the insns that is used to implement an Xin_MFence. See host_x86_defs.c around line 2533.
helgrind/tests/annotate_hbefore In r12008 I added a load fence as described in comment #5. That helped then. I'm positive. But today it's hanging again. So I'm disabling this test on s390x for now so I can get a regtest through on z900. (r12019)
Regtest now runs through in just over an hour elapsed time. Not too bad for that old piece of iron (z900).
(In reply to comment #10) > Regtest now runs through in just over an hour elapsed time. Not too bad for > that old piece of iron (z900). I tried to fix it on amd64 by adding an "sfence" instruction, but it did not help. A question about the state of Valgrind tests on s390 : is it ok to run now all tests on the CDS system ? Or is it still preferrable to not run some of these ?
As of r12019 you should be able to do "make regtest" and not have it hang.
see my comment #2. The code is obviously broken. If for some reason (loaded system, lots of processes) the other thead does not pass a certain point within 100ms the code will life lock, since we wake up before the other thread waits. Christian
(In reply to comment #13) > see my comment #2. The code is obviously broken. If for some reason (loaded > system, lots of processes) the other thead does not pass a certain point within > 100ms the code will life lock, since we wake up before the other thread waits. > Thanks for reminding. I missed it. Changed in r12031. Let's see how reliable that is.