Bug 369459 - valgrind on arm64 violates the ARMv8 spec (ldxr/stxr)
Summary: valgrind on arm64 violates the ARMv8 spec (ldxr/stxr)
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.12 SVN
Platform: Compiled Sources Linux
: NOR critical
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-28 08:19 UTC by Andrew Pinski
Modified: 2017-05-22 08:32 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2016-09-28 08:19:07 UTC
So on ARMv8, valgrind invokes undefined behavior with respect to ldxr/stxr.  This causes valgrind to hang on ThunderX machines.
here is the disassembly when I attach gdb to the valgrind process:
   0x00000008034001a4:  ldxr    w2, [x4]
   0x00000008034001a8:  mov     x22, x2
   0x00000008034001ac:  and     x24, x23, #0xffffffff
   0x00000008034001b0:  and     x23, x22, #0xffffffff
   0x00000008034001b4:  str     x24, [x21,#928]
   0x00000008034001b8:  str     x23, [x21,#32]
   0x00000008034001bc:  dsb     sy
=> 0x00000008034001c0:  dmb     sy
   0x00000008034001c4:  isb
….
   0x00000008034000e4: stxr    w0, w2, [x4]


As you can see there are plenty of stores and even DSB/DMB/ISB inbetween the ldxr and stxr.  All of this violates the ARM spec and says stxr can always fail and it does on ThunderX.
Comment 1 Andrew Pinski 2016-09-28 08:22:36 UTC
This is very much related to https://bugs.kde.org/show_bug.cgi?id=344524 (which is recorded for MIPS and it just happens both are Cavium cores too).
Comment 2 Peter Maydell 2016-09-28 14:46:15 UTC
I don't think this is strictly speaking invocation of undefined behaviour / UNPREDICTABLE -- it just isn't guaranteed that the code sequence will make forward progress.
Comment 3 Julian Seward 2016-09-29 04:15:52 UTC
Andrew, do you know which implementation this is?  eg is it a Cortex A-something,
or something else?
Comment 4 Andrew Pinski 2016-09-29 04:48:32 UTC
(In reply to Julian Seward from comment #3)
> Andrew, do you know which implementation this is?  eg is it a Cortex
> A-something,
> or something else?

It is a custom ARMv8-a core produced by Cavium.  It is not a cortex-a based at all.  We had a similar issues with Oracle JDK but we worked with Oracle to fix it; that was a year or so ago.

As mentioned we have a similar issue with the MIPS64 implementation of valgrind with our (Cavium's) MIPS64r3 core (Octeon 3).  I suspect this is going to need a fundamental change in valgrind to fix both of these issues.

One idea I have is to pattern match on the ldxr/stxr sequence and produce a single instruction in the IR
Comment 5 Andrew Pinski 2016-09-29 04:49:33 UTC
Opps pressed save too soon.

One idea I have is to pattern match on the ldxr/stxr sequence and produce a single instruction in the IR and then decode them after the fact.
Comment 6 Julian Seward 2016-10-23 17:31:54 UTC
(In reply to Andrew Pinski from comment #5)
> One idea I have is to pattern match on the ldxr/stxr sequence and produce a
> single instruction in the IR and then decode them after the fact.

On consideration, I think this is the least-worst option, but I would propose
to do it probably differently to how you envisage.  I would leave the front
end alone (guest_arm64_toIR.c) and instead do pattern matching on the IR
after initial optimisation and before instrumentation.  That keeps the IR
target independent, and it has the benefit of solving the problem for all
targets, not just ARM64.  The result of the pattern matching would be a
encoded into a new IRStmt kind which encapsulates a complete LL-SC
pair.

This does have the restriction that the LL and SC would have to be within
the same (extended) basic block, since otherwise they won't be in the same
IRSB, but I don't think that's much of a restriction.

We would then have to mess with all tools and pipeline compilation 
stages to handle this new construction, but there's no way around that;
and we'd need to take care, in register allocation, that there will be no
spilling in between the emitted LL and SC, since that would potentially
scupper the entire enterprise.
Comment 7 Julian Seward 2016-10-23 17:49:22 UTC
Hmm, I see stuff like this:

    9858:       885ffe62        ldaxr   w2, [x19]
    985c:       6b1f005f        cmp     w2, wzr
    9860:       54fff7e1        b.ne    975c <__pthread_mutex_lock+0x44>
    9864:       88017e60        stxr    w1, w0, [x19]

so they are not in the same block.  Hmm.  Back to the drawing board.
Comment 8 Peter Maydell 2016-10-23 19:32:43 UTC
FWIW QEMU is switching to emulating atomics (including ll/sc) via a common "cmpxchg" IR operation (which is then implemented in the backend via an ll/sc loop or whatever the host CPU has). This results in not-strictly-correct behaviour (it has the ABA problem) but realworld code doesn't really care. (I don't know much of the detail as I haven't been involved in the implementation.)
Comment 9 Julian Seward 2016-10-24 05:35:28 UTC
Maybe we could use Maran's proposal for fixing the same problem on
MIPS OCTEON3.  https://bugs.kde.org/show_bug.cgi?id=344524#c8 (and 9
and 10).

This provides a correct implementation, including coverage of ABA
cases, for threads within an application, and (IIUC) CAS-level
atomicity between a Valgrinded process and a non-Valgrinded process
that communicate via shared memory.  So it's not perfect, but it might
be good enough.  Also, it's simple and non-intrusive and so could be
done quickly.

I would be inclined to preserve the current implementation and use the
above only for cores where the current implementation fails, since the
current implementation is ABA-correct both for threads within a
process and across processes.  IOW the proposal is a best-effort
fallback.

Andrew, is there a way to automatically detect the relevant cores at
startup time, so that V can transparently switch to using the fallback
implementation without intervention from the user?

Peter, do you have any further information about the assertion that
"realworld code doesn't really care"?  I'd be interested to know more
about that.
Comment 10 Andrew Pinski 2016-10-24 05:51:43 UTC
>Andrew, is there a way to automatically detect the relevant cores at
startup time

You could scan /proc/cpuinfo for the "CPU implementer" value of 0x43 ('C') but that only catches the current cores which have this micro-arch effect.  Someone else in the future could implement this same issue too since they are following the arch to letter.
Comment 11 Julian Seward 2016-10-24 06:07:11 UTC
(In reply to Andrew Pinski from comment #10)
Another possibility is to run a test sequence on the host CPU at startup,
whilst we are still single threaded, containing LL, SC and some stores in
between, and see if it fails.
Comment 12 Julian Seward 2016-10-24 06:10:19 UTC
Andrew, do you know how well Maran's proposal
https://bugs.kde.org/show_bug.cgi?id=344524#c8 worked on MIPS64r3 (Octeon 3) ?
IOW is it worth taking and generalising?
Comment 13 Andrew Pinski 2016-10-24 06:25:03 UTC
(In reply to Julian Seward from comment #12)
> Andrew, do you know how well Maran's proposal
> https://bugs.kde.org/show_bug.cgi?id=344524#c8 worked on MIPS64r3 (Octeon 3)
> ?
> IOW is it worth taking and generalising?

We did not test it too much, in that we tested it only for a few weeks and was going to let imagination take over to see if it worked more and fix it for MIPS but they never did.
Comment 14 Peter Maydell 2016-10-24 09:38:11 UTC
I think the assertion about "real world code not caring" is based on some popular CPUs not having an ll/sc combination (like x86!), and so portable atomicity primitives can't assume you don't get ABA because they have to work with a lowest-common-denominator cmpxchg CPU.

I think your suggestion in comment 11 about trying to autodetect this kind of core sounds extremely fragile. The architecture says "don't do this", and implementations can take advantage, including perhaps being able to make forward progress in some cases even if not in others, so you might find your autodetect test code passed but later generated code didn't. Plus on big.LITTLE you might later be running on a CPU with a different microarchitecture from the one you ran your autodetection on...
Comment 15 Julian Seward 2016-10-24 09:55:19 UTC
(In reply to Peter Maydell from comment #14)

> [..] so you might find your
> autodetect test code passed but later generated code didn't.

True.

> Plus on big.LITTLE you might later be running on a CPU with a
> different microarchitecture from the one you ran your autodetection
> on...

Hmm, yes, good point.  I hadn't thought of that.
Comment 16 Julian Seward 2016-11-24 10:37:24 UTC
I would be happy to make a generalised version of Maran's fixes,
so as to get arm{32,64} working on these cores.  I don't really
want it to be the default implementation though, since we lose 
correctness w.r.t. the ABA problem against other processes in
shared memory, relative to using the existing implementation.

Ideally I'd like to have a command line argument 

  --fallback-llsc=[no|yes|auto]

to enable the alternative implementation when necessary, with
it defaulting to |auto| so as to not to have to bother the user
with selecting flags for what is an arcane detail.  So the real
question here is how to reliably auto-detect when the fallback
implementation should be used.
Comment 17 Julian Seward 2017-04-24 09:27:32 UTC
Fixed.  It appears to work, to the extent that I can actually test it
without access to the hardware in question.  r3352, r16309.  Use
--sim-hints=fallback-llsc to use the alternative implementation.
Comment 18 Julian Seward 2017-04-24 09:36:40 UTC
(In reply to Julian Seward from comment #17)
> Fixed.  [..]

Andrew, can you test the fix, please?
Comment 19 Paul Osmialowski 2017-04-27 16:30:08 UTC
Hi Guys,

I tested the fix on our dual-CPU (2x48 ThunderX cores) Linux machine and I can confirm that it worked.

To make my life easier, I've tested it with following change:

diff --git a/coregrind/m_translate.c b/coregrind/m_translate.c
index f763a947b..cec8af4aa 100644
--- a/coregrind/m_translate.c
+++ b/coregrind/m_translate.c
@@ -1706,8 +1706,7 @@ Bool VG_(translate) ( ThreadId tid,
 #  endif

 #  if defined(VGP_arm64_linux)
-   vex_abiinfo.guest__use_fallback_LLSC
-      = SimHintiS(SimHint_fallback_llsc, VG_(clo_sim_hints));
+   vex_abiinfo.guest__use_fallback_LLSC = True;
 #  endif

    /* Set up closure args. */
Comment 20 Ivo Raisr 2017-05-05 15:19:21 UTC
Do we consider this bug fixed now when --sim-hints=fallback-llsc is available?
Comment 21 Julian Seward 2017-05-11 14:14:22 UTC
(In reply to Paul Osmialowski from comment #19)
> I tested the fix on our dual-CPU (2x48 ThunderX cores) Linux machine and I
> can confirm that it worked.

Thanks for testing it!
Comment 22 Julian Seward 2017-05-16 05:36:10 UTC
Autodetection of Cavium cores and hence auto-enablement of the
fallback implementation, was added in r16380.
Comment 23 Paul Osmialowski 2017-05-16 06:18:59 UTC
Are you sure you applied this patch on VEX part too?

m_machine.c: In function 'vgPlain_parse_cpuinfo':
m_machine.c:810:10: error: 'VexArchInfo {aka struct <anonymous>}' has no member named 'arm64_requires_fallback_LLSC'
       vai.arm64_requires_fallback_LLSC = True;
          ^
m_machine.c: In function 'vgPlain_machine_get_hwcaps':
m_machine.c:1662:23: error: 'VexArchInfo {aka struct <anonymous>}' has no member named 'arm64_requires_fallback_LLSC'
                    vai.arm64_requires_fallback_LLSC ? "yes" : "no");
                       ^
Makefile:4274: recipe for target 'libcoregrind_arm64_linux_a-m_machine.o' failed
make[3]: *** [libcoregrind_arm64_linux_a-m_machine.o] Error 1
Comment 24 Julian Seward 2017-05-16 06:27:30 UTC
(In reply to Paul Osmialowski from comment #23)
> Are you sure you applied this patch on VEX part too?

Duh.  Not enough coffee (etc).  Committed, r3371.
Comment 25 Paul Osmialowski 2017-05-16 06:51:13 UTC
ok, now it builds and works.
Comment 26 Julian Seward 2017-05-22 08:30:22 UTC
Documentation updated in r16404.  Closing.