Bug 344524

Summary: store conditional of guest applications always fail - observed on Octeon3(MIPS)
Product: [Developer tools] valgrind Reporter: Maran Pakkirisamy <maran.pakkirisamy>
Component: vexAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: maran.pakkirisamy, mips32r2, tom
Priority: NOR    
Version: 3.10 SVN   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: patch to emulate LL/SC
testcase to show ABA problem
fix atomicity - valgrind
fix atomicity - VEX

Description Maran Pakkirisamy 2015-02-24 12:08:23 UTC
It is observed that on Octeon3 (MIPS), when valgrind encounters load linked/store conditional(LL/SC) sequence of instructions in the guest code, store conditional(SC) always fails and hence the guest application fails to progress. The situation is the same even with the LLD/SCD (64 bit) instruction pair.

It is to be noted that the code that contains LL/SC pair executes successfully when executed natively (without valgrind).

Octeon3 seems to be strict in its implementation of LL/SC pair that occurrence of any store instruction (store to any location irrelevant to the location dealt by the LL/SC pair) between the LL/SC pair makes SC to fail. However this is not the case with Octeon2. That is, any loads/stores to irrelevant memory locations that occur between the LL/SC sequence does not cause SC to fail.

Here is the testcase that supports the above observation. The testcase executes successfully on Octeon2 but hangs on Octeon3, when executed natively.

/* llsc.c */
#include <stdio.h>

int main() {
  unsigned long mem = 5;
  unsigned long *p = &mem;
  int n = 4;
  unsigned long block[3] = {
    (unsigned long)p,
    (unsigned long)n,
    0x0 };

  do {
    __asm__ __volatile__ (
                          "move  $t0, %0"       "\n\t"
                          "ld    $t1, 0($t0)"   "\n\t" //p
                          "ld    $t2, 8($t0)"   "\n\t" //n
                          "lld   $t3, 0($t1)"   "\n\t" // t3 = *p
                          "sd    $t3, 16($t0)"  "\n\t" // block[2] = t3 deliberate store that cases SC to fail on Octeon3 but not on Octeon2
                          "daddu $t3, $t3, $t2" "\n\t" // t3 = t3 + n
                          "scd   $t3, 0($t1)"   "\n\t" // *p = t3
                          "sd    $t3, 16($t0)"  "\n\t" // block[2] = t3
                          : /* out */
                          : /* in */ "r" (&block[0])
                          : /* trash */ "memory", "t0", "t1", "t2", "t3"
                          );
        printf("retry\n");
   }while (block[2] != 1);

  printf("%ld\n", *p);
  return 0;
}


As valgrind instruments guest code, the arbitrary load/store instructions added in between LL/SC instruction sequence causes the instrumented guest code to fail to progress beyond SC.


Original Semantics of LL/SC:
LL rt, offset(base) // load Mem[base + offset] in rt and mark start of RMW

... // Read Modify Write(RMW) region

SC rt, offset(base) // if Mem[base + offset] is not accessed by another thread
                    // ie. Mem[base + offset] still holds the old value
                    // store value at rt to Mem[base + offset] and 
                    // set the success code (1) in rt.
                    // Mark the end of RMW region. If the sufficient conditions
                    // are not met, set failure code (0) in rt and skip store.
For precise description of the LL/SC sematics, refer to the MIPS64 ISA document.

One of the conditions in which success of SC is unpredictable, as given in MIPS64 ISA:
"A load or store executed on the processor executing the LL and SC that is not to the block of synchronizable
physical memory containing the word. (The load or store may cause a cache eviction between the LL and SC that
results in SC failure. The load or store does not necessarily have to occur between the LL and SC.)"

A similar statement is mentioned in the SCD section (SCD's success is unpredictable when):
"A memory access instruction (load, store, or prefetch) is executed on the processor executing the LLD/SCD."

(A similar restriction on the success of store conditional instruction is also seen in aarch64 architecture manual.)

In line with the uncertainty mentioned in above specification, as illustrated in the testcase, a STORE (SD) instruction between SC/LL causes failure of SC on Octeon3 while SC succeeds on Octeon2.

Since the instrumentation logic is determined by the valgrind plugin and not by valgrind core it may not be possible to restrict valgrind to not to insert memory operations between LL(D)/SC(D) pair. Correct me if I am wrong.

The initial proposal is to emulate LL/SC pair. LL instruction will be replaced by normal load and SC will be replaced by a sequence of instructions that emulate the effect of SC.

Here is a rough overview of what can be done on seeing the LL/SC pair

LL rt, offset(base)
-> Step 1: valgrind will save the address base + offset  (let's say LLaddr) and the current value in that address(let's say LLdata) in the VEX guest state of the current thread
-> Step 2: LL will be replaced by normal load

... (RMW region)
-> valgrind will instrument the instructions as usual.

SC rt, offset(base)
-> valgrind will do the following
let's say, the location is SCaddr(base + offset) and the current value at SCaddr is SCdata
Step 1: if LLaddr != SCaddr, set rt to 0 (SC failure) and go to next instruction
/*
LLaddr is the location mentioned in LL instruction and has been saved in the thread's guest state, while encountering the latest LL instruction.
According to the specification, the location must be the same in LL/SC pair otherwise, SC will fail.
*/

Step 2: if LLdata != SCdata, set rt to 0 (SC failure) and go to next instruction
/* Similar to LLaddr, LLdata has been saved in the guest state while encountering the latest LL instruction.
Comparing LLdata and SCdata helps to ensure that the memory location has not been updated with a different value by another thread (As ensured by LL/SC semantics)
*/

Step 3:
Otherwise, using normal store instruction, store the value of rt  in SCaddr
and set rt to 1 (to indicate SC has succeeded)
and jump to the next instruction

As far as I observed, the entire steps to emulate SC is part of one basic block of valgrind. (I am not sure if it is possible for the above sequence to split into 2 basic blocks). Since valgrind serializes the guest threads, it is assumed that the emulation of SC is not intervened by another thread and hence executes "atomically". (I am not aware of valgrind's way of signal handling. So, I am not sure if signals could disrupt this "atomicity".)

The advantage of this proposal is, valgrind is free to insert memory access instructions between LL/SC pair without impacting the progress of the guest application

One disadvantage is, this emulation does not guarantee that ABA problem will not occur. (ABA problem will not arise with LL/SC's original semantics)
ABA problem: http://en.wikipedia.org/wiki/ABA_problem

Please consider this proposal as a base for discussion.

PS: As far as I searched, I do not see a valgrind bug that already discusses the issue. If I have overlooked it, please direct me to it.

Reproducible: Always

Steps to Reproduce:
Execute a guest application on valgrind that involves LL/SC on Octeon3.
Since libc uses LL/SC at many parts, it is most likely that all the applications fail to proceed on Octeon3.
Comment 1 Maran Pakkirisamy 2015-02-24 12:15:26 UTC
Created attachment 91266 [details]
patch to emulate LL/SC

The patch implements the proposal made in the description. It is enabled only for Octeon3 as it is a rudimentary attempt.
Comment 2 Julian Seward 2015-03-05 15:13:41 UTC
This is a clever idea to solve the problem.  It is a shame however that
the ABA semantics are not the same as for LL/SC natively.  That concerns
me because it means that there could be programs which, with this patch,
execute incorrectly.  Can you think of any way to avoid this problem?

Actually, as a first step, can you write a test case to show the ABA problem?
It might be possible to fix the problem.  But we need a test case first.
Comment 3 Julian Seward 2015-03-05 15:23:40 UTC
The idea basically is that, when a thread switch occurs (which is something
the Valgrind core can observe) it adjusts the LLdata and LLaddr fields of
all threads so as to cause any subsequent SCs to fail.  I think this might
fix the ABA problem, because IIUC the ABA problem requires some other
thread to run in addition to the thread doing the LL/SC.  And the V core
serialises thread execution.  So it observes the thread doing LL/SC being
de-scheduled in between the LL and SC, and adjusts its guest state so as to
cause the SC to fail when that thread is later resumed.

In fact it would probably be enough to invalidate any in-progress LLs only
for the thread being de-scheduled.  I can help find the place to put the
invalidation if that's useful.
Comment 4 Maran Pakkirisamy 2015-03-17 12:23:02 UTC
Created attachment 91595 [details]
testcase to show ABA problem

case 1:
./aba x
Thread 1(main): execute LL on X
Thread 2 (updater_X) : update the mem location X
Thread 1(main): execute SC on X

case 2:
./aba
Thread 1(main): execute LL on X (X is the memory location)
Thread 2 (updater_nonX) : update a mem location unrelated to X
Thread 1(main): execute SC on X

On Octeon2, when executed natively, case 1 fails inline with the semantics of LL/SC. But case 2 passes - MIPS ISA leave this case as unpredictable (updating an arbitrary memory location between LL/SC) and Octeon2 chooses to succeed SC in this case.

On Octeon3, when executed natively, case 1 fails inline with the semantics of LL/SC. Case 2 as well fails as Octeon3 chooses to fail SC in this case.

With the LL/SC emulation patch, both the cases pass (SC succeeds) when the testcase is executed on patched V on Octeon3.

"it would probably be enough to invalidate any in-progress LLs only for the thread being de-scheduled." - I think this should solve the problem. Can you help me to identify the places?
Comment 5 Julian Seward 2015-03-17 14:17:25 UTC
> "it would probably be enough to invalidate any in-progress LLs only for the
> thread being de-scheduled." - I think this should solve the problem. Can you
> help me to identify the places?

coregrind/m_scheduler/scheduler.c  run_thread_for_a_while()
The SCHEDSETJMP macro call causes code to actually run.  So,
any point after the SCHEDSETJMP call and before the end of the
function.
Comment 6 Maran Pakkirisamy 2015-03-19 12:17:21 UTC
> coregrind/m_scheduler/scheduler.c  run_thread_for_a_while()
> The SCHEDSETJMP macro call causes code to actually run.  So,
> any point after the SCHEDSETJMP call and before the end of the
> function.

With your suggested fix, the patch passes the aba testcase. However, memcheck/tests/atomic_incs fail. This testcase forks another process. If I alter the testcase so that it spawns 2 threads instead of 2 processes, the testcase passes. 

How does valgrind treat applications that fork? Is it different from multi-threaded application? Any pointers to the part of the code I should look into, will help.
Comment 7 Julian Seward 2015-03-19 13:21:18 UTC
(In reply to Maran Pakkirisamy from comment #6)
> With your suggested fix, the patch passes the aba testcase.

Good.

> However, memcheck/tests/atomic_incs fail.

Hmm, that's serious.  Because Valgrind serialises thread execution
(that is, only one thread runs at once), the change you made -- to use
threads rather than processes -- has the effect of not testing the
atomicity properties at all, because there is only ever one thread running
at any one time.  That's why the test uses 2 processes -- then then
genuinely run in parallel and genuinely test the atomicity properties
for these insns.

The same applies for any test program you write, that involves
multiple threads but only one process.

I suppose what you can infer from that is that your implementation is
somehow not providing the atomicity that you expect it to.  So, I'm
not really sure where that points, except to re-check the
implementation and its assumptions, somehow.

I hope you can figure this out, because fixing it would solve an
important problem on all targets that use LL/SC.
Comment 8 Maran Pakkirisamy 2015-03-26 07:18:47 UTC
(In reply to Julian Seward from comment #7)
> I suppose what you can infer from that is that your implementation is
> somehow not providing the atomicity that you expect it to.  So, I'm
> not really sure where that points, except to re-check the
> implementation and its assumptions, somehow.

Right. While I made the initial proposal, I did not give thought about system wide atomicity that involves other processes. I have updated the patches so that the whole LL/RMW/SC sequence is executed atomically. (ABA is still a problem :( )

With the update, only step 3 is changed.
That is when SC is encountered,
a) if LLaddr == 0, set rt as 0 to indicate SC has failed and move on to the next instruction.
(LLaddr == 0 indicates, thread was yielded before SC. So to ensure that we dont get into ABA (among the threads of instrumented application), LLaddr is invalidated (set to 0) whenever another thread is schedule)

b) if LLaddr != SCaddr, set rt as 0 to indicate SC has failed and move on to the next instruction.
(LLaddr != SCaddr indicates, the mem of SC does not correspond to mem of last LL, which is a failure according to the specification)

c) (a and b failed). Execute SC under CAS. That is, store will be performed if LLdata (saved when LL was executed) is same as SCdata (the current value at mem). And this step is performed atomically using CAS. This is the significant part of the change from the initial proposal. 

a) and b) helps to handle ABA scenario but only among the threads of the multithreaded application profiled
c) helps to ensure atomicity among all processes/threads. However it does not capture ABA case in this scenario.

The reason being, the RMW region is covered effectively by CAS and hence ABA problem persist.

However, with the patch, memcheck/tests/atomic_incs succeeds and no regressions observed.
I am not able to come up with a testcase that has ABA race among processes and that fails on V.  (I took the aba testcase for threads (attachment #91595 [details]) and modified it to spawn processes. The resultant testcase gives same result(SC fails) with and without valgrind. So, I do not yet have a testcase to prove ABA problem among processes still remain.)
Comment 9 Maran Pakkirisamy 2015-03-26 07:21:16 UTC
Created attachment 91744 [details]
fix atomicity - valgrind
Comment 10 Maran Pakkirisamy 2015-03-26 07:22:04 UTC
Created attachment 91745 [details]
fix atomicity - VEX
Comment 11 Maran Pakkirisamy 2015-03-26 08:45:39 UTC
I doubt, with the proposed approach of LL/SC emulation, the ABA problem can be addressed at all.

Here is an alternative proposal.
1) On encountering LL, convert it into normal load and start saving the instructions in a buffer
2) RMW - usual instrumentation. In addition, the instructions will be appended to the buffer
3) On encountering SC, the code buffer (LL/RMW/SC) will be executed as a (dirty) helper and the result of SC in the helper will be stored as the result in IR of SC. Then clear the code buffer

I am not sure whether this is feasible and if so, whether it can be done without making much changes to the existing framework.
Comment 12 Julian Seward 2016-10-24 05:48:01 UTC
Maran, the same problem has been reported for ARM64/OCTEON3 at
https://bugs.kde.org/show_bug.cgi?id=369459.   So let me ask: how well
does your proposed solution in comments 9 and 10 work?  Did you deploy
it?
Comment 13 Maran Pakkirisamy 2017-01-05 06:52:46 UTC
Our internal team continuously regtests V with the patch (solution in comments 9 and 10) and also ships the package. No issues related to this bug has been reported so far.
Comment 14 Petar Jovanovic 2017-03-13 18:54:21 UTC
The changes have been committed as:

Valgrind r16269
VEX r3316

Thanks Maran for the patches and sorry for the very long delay on this issue.
Comment 15 Petar Jovanovic 2017-03-14 17:15:16 UTC
We should close this issue now.
Comment 16 Julian Seward 2017-04-20 08:53:28 UTC
(In reply to Maran Pakkirisamy from comment #8)

Maran,

I was studying this bug and your fix, so as to see how to apply it to ARM.
I have a question:

> With the update, only step 3 is changed.
> [..]
> c) (a and b failed). Execute SC under CAS. That is, store will be performed
> if LLdata (saved when LL was executed) is same as SCdata (the current value
> at mem). And this step is performed atomically using CAS. This is the
> significant part of the change from the initial proposal. 

Your implementation ignores whether the CAS was successful or not.

Should it instead cause the SC to fail (set rt to zero) if the CAS fails?
Comment 17 Julian Seward 2017-04-20 09:00:51 UTC
Also, your implementation uses guest_state.LLaddr == 0 to mean "there
is no transaction in progress".  So guest_state.LLaddr == 0 has a special
meaning that is different from all other values.

I'd prefer to remove that special meaning and instead use a third
boolean field.  Also, it seems safer to invalidate any in-progress
transaction when the thread is scheduled, not de scheduled.  So my
pseudocode summary so far is

  LoadLinked(addr)

    gs.LLvalid = 1
    gs.LLaddr  = addr
    gs.LLdata  = *addr

  StoreCond(addr, data)

    if gs.LLvalid != 1)   -> fail
    if addr != gs.LLaddr  -> fail
    if *addr != gs.LLdata -> fail
    cas_ok = CAS(addr, gs.LLdata -> data)
    if !cas_ok            -> fail
    gs.LLvalid = 0
    succeed

  When thread scheduled

    gs.LLvalid = 0
Comment 18 Julian Seward 2017-04-20 09:14:51 UTC
Actually, I'd prefer to have it so that any attempt to do StoreCond
will cause all following attempts to fail, up until a new LoadLinked
is done.  How does that sound?

  LoadLinked(addr)

    gs.LLvalid = 1
    gs.LLaddr  = addr
    gs.LLdata  = *addr

  StoreCond(addr, data)

    tmp_LLvalid = gs.LLvalid
    gs.LLvalid = 0
    if tmp_LLvalid != 1   -> fail
    if addr != gs.LLaddr  -> fail
    if *addr != gs.LLdata -> fail
    cas_ok = CAS(addr, gs.LLdata -> data)
    if !cas_ok            -> fail
    succeed

  When thread scheduled

    gs.LLvalid = 0
Comment 19 Julian Seward 2017-04-20 16:05:45 UTC
Hmm, even that is too relaxed, because it doesn't reject mismatched
load vs store sizes.  Here's a variant that does check sizes, and 
uses size == 0 to mean "no transaction pending".

LoadLinked(addr)

  gs.LLsize = load_size // 1, 2, 4 or 8
  gs.LLaddr = addr
  gs.LLdata = zeroExtend(*addr)

StoreCond(addr, data)

  tmp_LLsize = gs.LLsize
  gs.LLsize = 0 // "no transaction"
  if tmp_LLsize != store_size        -> fail
  if addr != gs.LLaddr               -> fail
  if zeroExtend(*addr) != gs.LLdata  -> fail
  cas_ok = CAS(store_size, addr, gs.LLdata -> data)
  if !cas_ok                         -> fail
  succeed

When thread scheduled

  gs.LLsize = 0 // "no transaction"
Comment 20 Maran Pakkirisamy 2017-05-22 07:03:37 UTC
Sorry, just now I noticed this update.
And then found in bug #369459 that the issues are taken care and fixed. Thanks.