Summary: | store conditional of guest applications always fail - observed on Octeon3(MIPS) | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Maran Pakkirisamy <maran.pakkirisamy> |
Component: | vex | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | maran.pakkirisamy, mips32r2, tom |
Priority: | NOR | ||
Version: | 3.10 SVN | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
patch to emulate LL/SC
testcase to show ABA problem fix atomicity - valgrind fix atomicity - VEX |
Description
Maran Pakkirisamy
2015-02-24 12:08:23 UTC
Created attachment 91266 [details]
patch to emulate LL/SC
The patch implements the proposal made in the description. It is enabled only for Octeon3 as it is a rudimentary attempt.
This is a clever idea to solve the problem. It is a shame however that the ABA semantics are not the same as for LL/SC natively. That concerns me because it means that there could be programs which, with this patch, execute incorrectly. Can you think of any way to avoid this problem? Actually, as a first step, can you write a test case to show the ABA problem? It might be possible to fix the problem. But we need a test case first. The idea basically is that, when a thread switch occurs (which is something the Valgrind core can observe) it adjusts the LLdata and LLaddr fields of all threads so as to cause any subsequent SCs to fail. I think this might fix the ABA problem, because IIUC the ABA problem requires some other thread to run in addition to the thread doing the LL/SC. And the V core serialises thread execution. So it observes the thread doing LL/SC being de-scheduled in between the LL and SC, and adjusts its guest state so as to cause the SC to fail when that thread is later resumed. In fact it would probably be enough to invalidate any in-progress LLs only for the thread being de-scheduled. I can help find the place to put the invalidation if that's useful. Created attachment 91595 [details]
testcase to show ABA problem
case 1:
./aba x
Thread 1(main): execute LL on X
Thread 2 (updater_X) : update the mem location X
Thread 1(main): execute SC on X
case 2:
./aba
Thread 1(main): execute LL on X (X is the memory location)
Thread 2 (updater_nonX) : update a mem location unrelated to X
Thread 1(main): execute SC on X
On Octeon2, when executed natively, case 1 fails inline with the semantics of LL/SC. But case 2 passes - MIPS ISA leave this case as unpredictable (updating an arbitrary memory location between LL/SC) and Octeon2 chooses to succeed SC in this case.
On Octeon3, when executed natively, case 1 fails inline with the semantics of LL/SC. Case 2 as well fails as Octeon3 chooses to fail SC in this case.
With the LL/SC emulation patch, both the cases pass (SC succeeds) when the testcase is executed on patched V on Octeon3.
"it would probably be enough to invalidate any in-progress LLs only for the thread being de-scheduled." - I think this should solve the problem. Can you help me to identify the places?
> "it would probably be enough to invalidate any in-progress LLs only for the
> thread being de-scheduled." - I think this should solve the problem. Can you
> help me to identify the places?
coregrind/m_scheduler/scheduler.c run_thread_for_a_while()
The SCHEDSETJMP macro call causes code to actually run. So,
any point after the SCHEDSETJMP call and before the end of the
function.
> coregrind/m_scheduler/scheduler.c run_thread_for_a_while()
> The SCHEDSETJMP macro call causes code to actually run. So,
> any point after the SCHEDSETJMP call and before the end of the
> function.
With your suggested fix, the patch passes the aba testcase. However, memcheck/tests/atomic_incs fail. This testcase forks another process. If I alter the testcase so that it spawns 2 threads instead of 2 processes, the testcase passes.
How does valgrind treat applications that fork? Is it different from multi-threaded application? Any pointers to the part of the code I should look into, will help.
(In reply to Maran Pakkirisamy from comment #6) > With your suggested fix, the patch passes the aba testcase. Good. > However, memcheck/tests/atomic_incs fail. Hmm, that's serious. Because Valgrind serialises thread execution (that is, only one thread runs at once), the change you made -- to use threads rather than processes -- has the effect of not testing the atomicity properties at all, because there is only ever one thread running at any one time. That's why the test uses 2 processes -- then then genuinely run in parallel and genuinely test the atomicity properties for these insns. The same applies for any test program you write, that involves multiple threads but only one process. I suppose what you can infer from that is that your implementation is somehow not providing the atomicity that you expect it to. So, I'm not really sure where that points, except to re-check the implementation and its assumptions, somehow. I hope you can figure this out, because fixing it would solve an important problem on all targets that use LL/SC. (In reply to Julian Seward from comment #7) > I suppose what you can infer from that is that your implementation is > somehow not providing the atomicity that you expect it to. So, I'm > not really sure where that points, except to re-check the > implementation and its assumptions, somehow. Right. While I made the initial proposal, I did not give thought about system wide atomicity that involves other processes. I have updated the patches so that the whole LL/RMW/SC sequence is executed atomically. (ABA is still a problem :( ) With the update, only step 3 is changed. That is when SC is encountered, a) if LLaddr == 0, set rt as 0 to indicate SC has failed and move on to the next instruction. (LLaddr == 0 indicates, thread was yielded before SC. So to ensure that we dont get into ABA (among the threads of instrumented application), LLaddr is invalidated (set to 0) whenever another thread is schedule) b) if LLaddr != SCaddr, set rt as 0 to indicate SC has failed and move on to the next instruction. (LLaddr != SCaddr indicates, the mem of SC does not correspond to mem of last LL, which is a failure according to the specification) c) (a and b failed). Execute SC under CAS. That is, store will be performed if LLdata (saved when LL was executed) is same as SCdata (the current value at mem). And this step is performed atomically using CAS. This is the significant part of the change from the initial proposal. a) and b) helps to handle ABA scenario but only among the threads of the multithreaded application profiled c) helps to ensure atomicity among all processes/threads. However it does not capture ABA case in this scenario. The reason being, the RMW region is covered effectively by CAS and hence ABA problem persist. However, with the patch, memcheck/tests/atomic_incs succeeds and no regressions observed. I am not able to come up with a testcase that has ABA race among processes and that fails on V. (I took the aba testcase for threads (attachment #91595 [details]) and modified it to spawn processes. The resultant testcase gives same result(SC fails) with and without valgrind. So, I do not yet have a testcase to prove ABA problem among processes still remain.) Created attachment 91744 [details]
fix atomicity - valgrind
Created attachment 91745 [details]
fix atomicity - VEX
I doubt, with the proposed approach of LL/SC emulation, the ABA problem can be addressed at all. Here is an alternative proposal. 1) On encountering LL, convert it into normal load and start saving the instructions in a buffer 2) RMW - usual instrumentation. In addition, the instructions will be appended to the buffer 3) On encountering SC, the code buffer (LL/RMW/SC) will be executed as a (dirty) helper and the result of SC in the helper will be stored as the result in IR of SC. Then clear the code buffer I am not sure whether this is feasible and if so, whether it can be done without making much changes to the existing framework. Maran, the same problem has been reported for ARM64/OCTEON3 at https://bugs.kde.org/show_bug.cgi?id=369459. So let me ask: how well does your proposed solution in comments 9 and 10 work? Did you deploy it? Our internal team continuously regtests V with the patch (solution in comments 9 and 10) and also ships the package. No issues related to this bug has been reported so far. The changes have been committed as: Valgrind r16269 VEX r3316 Thanks Maran for the patches and sorry for the very long delay on this issue. We should close this issue now. (In reply to Maran Pakkirisamy from comment #8) Maran, I was studying this bug and your fix, so as to see how to apply it to ARM. I have a question: > With the update, only step 3 is changed. > [..] > c) (a and b failed). Execute SC under CAS. That is, store will be performed > if LLdata (saved when LL was executed) is same as SCdata (the current value > at mem). And this step is performed atomically using CAS. This is the > significant part of the change from the initial proposal. Your implementation ignores whether the CAS was successful or not. Should it instead cause the SC to fail (set rt to zero) if the CAS fails? Also, your implementation uses guest_state.LLaddr == 0 to mean "there is no transaction in progress". So guest_state.LLaddr == 0 has a special meaning that is different from all other values. I'd prefer to remove that special meaning and instead use a third boolean field. Also, it seems safer to invalidate any in-progress transaction when the thread is scheduled, not de scheduled. So my pseudocode summary so far is LoadLinked(addr) gs.LLvalid = 1 gs.LLaddr = addr gs.LLdata = *addr StoreCond(addr, data) if gs.LLvalid != 1) -> fail if addr != gs.LLaddr -> fail if *addr != gs.LLdata -> fail cas_ok = CAS(addr, gs.LLdata -> data) if !cas_ok -> fail gs.LLvalid = 0 succeed When thread scheduled gs.LLvalid = 0 Actually, I'd prefer to have it so that any attempt to do StoreCond will cause all following attempts to fail, up until a new LoadLinked is done. How does that sound? LoadLinked(addr) gs.LLvalid = 1 gs.LLaddr = addr gs.LLdata = *addr StoreCond(addr, data) tmp_LLvalid = gs.LLvalid gs.LLvalid = 0 if tmp_LLvalid != 1 -> fail if addr != gs.LLaddr -> fail if *addr != gs.LLdata -> fail cas_ok = CAS(addr, gs.LLdata -> data) if !cas_ok -> fail succeed When thread scheduled gs.LLvalid = 0 Hmm, even that is too relaxed, because it doesn't reject mismatched load vs store sizes. Here's a variant that does check sizes, and uses size == 0 to mean "no transaction pending". LoadLinked(addr) gs.LLsize = load_size // 1, 2, 4 or 8 gs.LLaddr = addr gs.LLdata = zeroExtend(*addr) StoreCond(addr, data) tmp_LLsize = gs.LLsize gs.LLsize = 0 // "no transaction" if tmp_LLsize != store_size -> fail if addr != gs.LLaddr -> fail if zeroExtend(*addr) != gs.LLdata -> fail cas_ok = CAS(store_size, addr, gs.LLdata -> data) if !cas_ok -> fail succeed When thread scheduled gs.LLsize = 0 // "no transaction" Sorry, just now I noticed this update. And then found in bug #369459 that the issues are taken care and fixed. Thanks. |