So on ARMv8, valgrind invokes undefined behavior with respect to ldxr/stxr. This causes valgrind to hang on ThunderX machines. here is the disassembly when I attach gdb to the valgrind process: 0x00000008034001a4: ldxr w2, [x4] 0x00000008034001a8: mov x22, x2 0x00000008034001ac: and x24, x23, #0xffffffff 0x00000008034001b0: and x23, x22, #0xffffffff 0x00000008034001b4: str x24, [x21,#928] 0x00000008034001b8: str x23, [x21,#32] 0x00000008034001bc: dsb sy => 0x00000008034001c0: dmb sy 0x00000008034001c4: isb …. 0x00000008034000e4: stxr w0, w2, [x4] As you can see there are plenty of stores and even DSB/DMB/ISB inbetween the ldxr and stxr. All of this violates the ARM spec and says stxr can always fail and it does on ThunderX.
This is very much related to https://bugs.kde.org/show_bug.cgi?id=344524 (which is recorded for MIPS and it just happens both are Cavium cores too).
I don't think this is strictly speaking invocation of undefined behaviour / UNPREDICTABLE -- it just isn't guaranteed that the code sequence will make forward progress.
Andrew, do you know which implementation this is? eg is it a Cortex A-something, or something else?
(In reply to Julian Seward from comment #3) > Andrew, do you know which implementation this is? eg is it a Cortex > A-something, > or something else? It is a custom ARMv8-a core produced by Cavium. It is not a cortex-a based at all. We had a similar issues with Oracle JDK but we worked with Oracle to fix it; that was a year or so ago. As mentioned we have a similar issue with the MIPS64 implementation of valgrind with our (Cavium's) MIPS64r3 core (Octeon 3). I suspect this is going to need a fundamental change in valgrind to fix both of these issues. One idea I have is to pattern match on the ldxr/stxr sequence and produce a single instruction in the IR
Opps pressed save too soon. One idea I have is to pattern match on the ldxr/stxr sequence and produce a single instruction in the IR and then decode them after the fact.
(In reply to Andrew Pinski from comment #5) > One idea I have is to pattern match on the ldxr/stxr sequence and produce a > single instruction in the IR and then decode them after the fact. On consideration, I think this is the least-worst option, but I would propose to do it probably differently to how you envisage. I would leave the front end alone (guest_arm64_toIR.c) and instead do pattern matching on the IR after initial optimisation and before instrumentation. That keeps the IR target independent, and it has the benefit of solving the problem for all targets, not just ARM64. The result of the pattern matching would be a encoded into a new IRStmt kind which encapsulates a complete LL-SC pair. This does have the restriction that the LL and SC would have to be within the same (extended) basic block, since otherwise they won't be in the same IRSB, but I don't think that's much of a restriction. We would then have to mess with all tools and pipeline compilation stages to handle this new construction, but there's no way around that; and we'd need to take care, in register allocation, that there will be no spilling in between the emitted LL and SC, since that would potentially scupper the entire enterprise.
Hmm, I see stuff like this: 9858: 885ffe62 ldaxr w2, [x19] 985c: 6b1f005f cmp w2, wzr 9860: 54fff7e1 b.ne 975c <__pthread_mutex_lock+0x44> 9864: 88017e60 stxr w1, w0, [x19] so they are not in the same block. Hmm. Back to the drawing board.
FWIW QEMU is switching to emulating atomics (including ll/sc) via a common "cmpxchg" IR operation (which is then implemented in the backend via an ll/sc loop or whatever the host CPU has). This results in not-strictly-correct behaviour (it has the ABA problem) but realworld code doesn't really care. (I don't know much of the detail as I haven't been involved in the implementation.)
Maybe we could use Maran's proposal for fixing the same problem on MIPS OCTEON3. https://bugs.kde.org/show_bug.cgi?id=344524#c8 (and 9 and 10). This provides a correct implementation, including coverage of ABA cases, for threads within an application, and (IIUC) CAS-level atomicity between a Valgrinded process and a non-Valgrinded process that communicate via shared memory. So it's not perfect, but it might be good enough. Also, it's simple and non-intrusive and so could be done quickly. I would be inclined to preserve the current implementation and use the above only for cores where the current implementation fails, since the current implementation is ABA-correct both for threads within a process and across processes. IOW the proposal is a best-effort fallback. Andrew, is there a way to automatically detect the relevant cores at startup time, so that V can transparently switch to using the fallback implementation without intervention from the user? Peter, do you have any further information about the assertion that "realworld code doesn't really care"? I'd be interested to know more about that.
>Andrew, is there a way to automatically detect the relevant cores at startup time You could scan /proc/cpuinfo for the "CPU implementer" value of 0x43 ('C') but that only catches the current cores which have this micro-arch effect. Someone else in the future could implement this same issue too since they are following the arch to letter.
(In reply to Andrew Pinski from comment #10) Another possibility is to run a test sequence on the host CPU at startup, whilst we are still single threaded, containing LL, SC and some stores in between, and see if it fails.
Andrew, do you know how well Maran's proposal https://bugs.kde.org/show_bug.cgi?id=344524#c8 worked on MIPS64r3 (Octeon 3) ? IOW is it worth taking and generalising?
(In reply to Julian Seward from comment #12) > Andrew, do you know how well Maran's proposal > https://bugs.kde.org/show_bug.cgi?id=344524#c8 worked on MIPS64r3 (Octeon 3) > ? > IOW is it worth taking and generalising? We did not test it too much, in that we tested it only for a few weeks and was going to let imagination take over to see if it worked more and fix it for MIPS but they never did.
I think the assertion about "real world code not caring" is based on some popular CPUs not having an ll/sc combination (like x86!), and so portable atomicity primitives can't assume you don't get ABA because they have to work with a lowest-common-denominator cmpxchg CPU. I think your suggestion in comment 11 about trying to autodetect this kind of core sounds extremely fragile. The architecture says "don't do this", and implementations can take advantage, including perhaps being able to make forward progress in some cases even if not in others, so you might find your autodetect test code passed but later generated code didn't. Plus on big.LITTLE you might later be running on a CPU with a different microarchitecture from the one you ran your autodetection on...
(In reply to Peter Maydell from comment #14) > [..] so you might find your > autodetect test code passed but later generated code didn't. True. > Plus on big.LITTLE you might later be running on a CPU with a > different microarchitecture from the one you ran your autodetection > on... Hmm, yes, good point. I hadn't thought of that.
I would be happy to make a generalised version of Maran's fixes, so as to get arm{32,64} working on these cores. I don't really want it to be the default implementation though, since we lose correctness w.r.t. the ABA problem against other processes in shared memory, relative to using the existing implementation. Ideally I'd like to have a command line argument --fallback-llsc=[no|yes|auto] to enable the alternative implementation when necessary, with it defaulting to |auto| so as to not to have to bother the user with selecting flags for what is an arcane detail. So the real question here is how to reliably auto-detect when the fallback implementation should be used.
Fixed. It appears to work, to the extent that I can actually test it without access to the hardware in question. r3352, r16309. Use --sim-hints=fallback-llsc to use the alternative implementation.
(In reply to Julian Seward from comment #17) > Fixed. [..] Andrew, can you test the fix, please?
Hi Guys, I tested the fix on our dual-CPU (2x48 ThunderX cores) Linux machine and I can confirm that it worked. To make my life easier, I've tested it with following change: diff --git a/coregrind/m_translate.c b/coregrind/m_translate.c index f763a947b..cec8af4aa 100644 --- a/coregrind/m_translate.c +++ b/coregrind/m_translate.c @@ -1706,8 +1706,7 @@ Bool VG_(translate) ( ThreadId tid, # endif # if defined(VGP_arm64_linux) - vex_abiinfo.guest__use_fallback_LLSC - = SimHintiS(SimHint_fallback_llsc, VG_(clo_sim_hints)); + vex_abiinfo.guest__use_fallback_LLSC = True; # endif /* Set up closure args. */
Do we consider this bug fixed now when --sim-hints=fallback-llsc is available?
(In reply to Paul Osmialowski from comment #19) > I tested the fix on our dual-CPU (2x48 ThunderX cores) Linux machine and I > can confirm that it worked. Thanks for testing it!
Autodetection of Cavium cores and hence auto-enablement of the fallback implementation, was added in r16380.
Are you sure you applied this patch on VEX part too? m_machine.c: In function 'vgPlain_parse_cpuinfo': m_machine.c:810:10: error: 'VexArchInfo {aka struct <anonymous>}' has no member named 'arm64_requires_fallback_LLSC' vai.arm64_requires_fallback_LLSC = True; ^ m_machine.c: In function 'vgPlain_machine_get_hwcaps': m_machine.c:1662:23: error: 'VexArchInfo {aka struct <anonymous>}' has no member named 'arm64_requires_fallback_LLSC' vai.arm64_requires_fallback_LLSC ? "yes" : "no"); ^ Makefile:4274: recipe for target 'libcoregrind_arm64_linux_a-m_machine.o' failed make[3]: *** [libcoregrind_arm64_linux_a-m_machine.o] Error 1
(In reply to Paul Osmialowski from comment #23) > Are you sure you applied this patch on VEX part too? Duh. Not enough coffee (etc). Committed, r3371.
ok, now it builds and works.
Documentation updated in r16404. Closing.