Summary: | Crash in helgrind with assertion: Helgrind: libhb_core.c:3762 (msm_write): Assertion 'ordxx == POrd_EQ || ordxx == POrd_LT' failed. | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Martin <martin> |
Component: | helgrind | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | aacid, lu_zero, tony |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Compiled Sources | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: |
Description
Martin
2009-01-20 14:32:47 UTC
> Helgrind: libhb_core.c:3762 (msm_write):
> Assertion 'ordxx == POrd_EQ || ordxx == POrd_LT' failed.
Urr, that _really_ shouldn't happen. I can't begin to guess why
it did. Are you able to make a small test case that I can use to
reproduce and fix the problem? Without being able to reproduce
it there's little hope of fixing it.
Potentially a serious problem. I have a possible hypothesis about what this might be, but without a test case it's going to be essentially impossible to develop + test a fix. I'll try and put a test case together - as you can imagine, it will be very difficult to isolate the exact case! > I'll try and put a test case together
Thanks! Just as a comment, my working hypothesis (now that I
actually have one -- for the first couple of weeks I had no idea
how this could even in theory happen) -- is that this is a race
inside Helgrind itself. If that's correct, then I expect you
may find this difficult to reproduce in that identical back-to
back runs do and don't fail. So that's one possible behaviour
to look out for.
No, it fails every single time reliably. I am starting to reduce the test case - it actually occurs, as you might guess from the stack trace, inside a boost unit test which starts a number of threads. I can reproduce the same problem here. No short reproduce case unfortunately, as it involves quite a big project. *** Bug 190391 has been marked as a duplicate of this bug. *** #190391 is claimed to have a repeatable test case; if so that will be useful in diagnosing the problem. I managed to get the same result while debugging feng http://cgit.lscube.org/cgit.cgi/feng/commit/?id=89d1d0493fca13a4370e3ceed60430e899f128e3 valgrind version 3.4.1 This appears to be to do with incorrect handling of memory following a race, in code that uses reader-writer locks. I believe the following patch should fix it. It would be good if you (all :-) could test it. It should apply against a standard 3.4.1 tarball, possibly with some fuzz. You may also want to consider applying the patch that fixes #188248. That fixes a different problem in Helgrind's handling rwlocks. It is the case that some programs trigger both bugs, so both fixes are necessary. Index: helgrind/libhb_core.c =================================================================== --- helgrind/libhb_core.c (revision 9900) +++ helgrind/libhb_core.c (working copy) @@ -3670,18 +3670,21 @@ /* assert on sanity of constraints. */ POrd ordxx = VtsID__getOrdering(rmini,wmini); tl_assert(ordxx == POrd_EQ || ordxx == POrd_LT); - svNew = MSM_RACE2ERR - ? SVal__mkE() - /* see comments on corresponding fragment in - msm_write for explanation. */ - /* aggressive setting: */ - /* - : SVal__mkC( VtsID__join2(wmini,tviR), - VtsID__join2(wmini,tviW) ); - */ - /* "consistent" setting: */ - : SVal__mkC( VtsID__join2(rmini,tviR), - VtsID__join2(wmini,tviW) ); + /* Compute svNew following the race. This isn't so + simple. */ + /* see comments on corresponding fragment in + msm_write for explanation. */ + if (MSM_RACE2ERR) { + /* XXX UNUSED; this should be deleted */ + tl_assert(0); + svNew = SVal__mkE(); + } else { + VtsID r_joined = VtsID__join2(rmini,tviR); + VtsID w_joined = VtsID__join2(wmini,tviW); + /* ensure that r_joined <= w_joined */ + w_joined = VtsID__join2( w_joined, r_joined ); + svNew = SVal__mkC( r_joined, w_joined ); + } record_race_info( acc_thr, acc_addr, szB, False/*!isWrite*/ ); goto out; } @@ -3747,22 +3750,40 @@ /* assert on sanity of constraints. */ POrd ordxx = VtsID__getOrdering(rmini,wmini); tl_assert(ordxx == POrd_EQ || ordxx == POrd_LT); - svNew = MSM_RACE2ERR - ? SVal__mkE() - /* One possibility is, after a race is seen, to - set the location's constraints as aggressively - (as far ahead) as possible. However, that just - causes lots more races to be reported, which is - very confusing. Hence don't do this. */ - /* - : SVal__mkC( VtsID__join2(wmini,tviR), - VtsID__join2(wmini,tviW) ); - */ - /* instead, re-set the constraints in a way which - is consistent with (ie, as they would have been - computed anyway) had no race been detected. */ - : SVal__mkC( VtsID__join2(rmini,tviR), - VtsID__join2(wmini,tviW) ); + /* Compute svNew following the race. This isn't so + simple. */ + if (MSM_RACE2ERR) { + /* XXX UNUSED; this should be deleted */ + tl_assert(0); + svNew = SVal__mkE(); + /* One possibility is, after a race is seen, to + set the location's constraints as aggressively + (as far ahead) as possible. However, that just + causes lots more races to be reported, which is + very confusing. Hence don't do this. */ + /* + = SVal__mkC( VtsID__join2(wmini,tviR), + VtsID__join2(wmini,tviW) ); + */ + } else { + /* instead, re-set the constraints in a way which is + consistent with (ie, as they would have been computed + anyway) the case where no race was detected. */ + VtsID r_joined = VtsID__join2(rmini,tviR); + VtsID w_joined = VtsID__join2(wmini,tviW); + /* Because of the race, the "normal" ordering constraint + wmini(constraint) <= tviW(actual access) no longer + holds. Hence it can be that the required constraint + (on SVal_Cs) r_joined <= w_joined does not hold either. + To fix this and guarantee we're not generating invalid + SVal_Cs, do w_joined = w_joined `join` r_joined, so as + to force r_joined <= w_joined in the arguments to + SVal__mkC. I think this is only important when we're + dealing with reader-writer locks. + */ + w_joined = VtsID__join2( w_joined, r_joined ); + svNew = SVal__mkC( r_joined, w_joined ); + } record_race_info( acc_thr, acc_addr, szB, True/*isWrite*/ ); goto out; } This patch and the patch for #188248 both work for me, and my program now runs cleanly. Applied to valgrind svn and seems to work correctly Crash fixed here too. Thanks for testing. Looks good. Ok to close this bug now? Fixed (trunk r10074). Fix will also be shipped in 3.4.2. |