Sometimes Valgrind cannot unwind the stack correctly on s390x. For instance, consider this program: int fun_b(int *p) { return *p; } int fun_a(int *p) { return fun_b(p); } int main() { return fun_a(0); } when compiling with gcc 8.0.1 and running under Valgrind, the backtrace looks like this: ==100497== at 0x1000516: fun_b (in /home/arnez/tmp/backtrace-test) ==100497== by 0x1000551: fun_a (in /home/arnez/tmp/backtrace-test) ==100497== by 0x4866107: ??? (in /usr/lib64/libc-2.27.so) Note that "main" is missing from the backtrace. This problem also affects various test cases in Valgrind's test suite.
The assembly of fun_b from the sample program above starts like this: ldgr %f2,%r11 .cfi_register 11, 17 ldgr %f0,%r15 .cfi_register 15, 16 Which means that the entry values of r11 (frame pointer) and r15 (stack pointer) must be retrieved from floating-point registers f2 and f0, respectively. In leaf functions like this it is pretty common for newer GCCs to spill general-purpose registers into floating-point registers.
This is ungood; but that said: if the number of FP registers involved is small and fixed (eg, it's only ever f0/f1/f2/f3 usw) then we might be able to fix it within the existing unwind framework, by adding the relevant FP registers to the set of registers that are unwound for s390. Basically that would mean adding the relevant registers to struct D3UnwindRegs, and then adding code to update the new values.
To summarize some of the discussion on IRC about this topic today (please correct if I misunderstood anything): * Valgrind's unwinder needs to be extra fast. For instance, Memcheck takes a stack trace for each block allocated, and then again when it's freed. * In order to be fast, the unwinder limits the number of registers to unwind. Currently no platform unwinds more than 5 registers. * Initial register values are copied into an UnwindStartRegs in VG_(get_UnwindStartRegs) from the registers of the specified simulated threads. * In order to use FPRs for unwinding, they have to be added to the UnwindStartRegs and D3UnwindRegs structures, and summary rule fields for the s390 variant of DiCfSI must be added as well. Preferably we only add a few FPRs, to limit the performance impact. I'll probably look into that soon.
Some background info: Spilling into FPRs has been around for a while. In upstream GCC it was introduced for s390x with this commit: https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=203303 Which is described by this post on the gcc-patches mailing list: https://gcc.gnu.org/ml/gcc-patches/2013-10/msg00534.html
Created attachment 119229 [details] Proposed fix For s390x, adds unwinding of f0..f7, so as to be able the handle the case where GPRs are saved in caller-saved fp registers.
(In reply to Julian Seward from comment #5) > Created attachment 119229 [details] > Proposed fix > > For s390x, adds unwinding of f0..f7, so as to be able the handle the case > where GPRs are saved in caller-saved fp registers. Thanks! This certainly helps a lot. In my limited testing the unwinding seems to have worked consistently with this patch. Note that I've only tried GCC 8.0.1. A few comments: > [...] > --- a/coregrind/m_debuginfo/debuginfo.c > +++ b/coregrind/m_debuginfo/debuginfo.c > @@ -3199,6 +3199,15 @@ Addr ML_(get_CFA) ( Addr ip, Addr sp, Addr fp, > [...] > uregs.sp = sp; > uregs.fp = fp; > + /* JRS FIXME 3 Apr 2019: surely we can do better for f0..f7 */ > + uregs.f0 = 0; > + uregs.f1 = 0; I think we can skip the FPR initialization here. IIUC, we'd only need it if the CFA depended on an FPR. But this would be a very unusual compiler decision, because an FPR can't be used for addressing. I'd say we can rule that out. > [...] > @@ -3259,6 +3268,8 @@ void VG_(ppUnwindInfo) (Addr from, Addr to) > For arm, the unwound registers are: R7 R11 R12 R13 R14 R15. > > For arm64, the unwound registers are: X29(FP) X30(LR) SP PC. > + > + For s390, the unwound registers are: R11(FP) R14(LR) R15(SP) F0 F2 F4 … This doesn't match the current implementation, right? In fact it seems that F0-F7 are unwound, although F0, F2, and F4 are probably sufficient (I hope). Theoretically we could even avoid unwinding the FPRs at all; we just need them for the initial frame. Since all of F0-F7 are volatile, the caller can't really have saved anything there. By the way, the PC doesn't need to be unwound either, since unwound frames have PC == R14. > [...] > --- a/coregrind/m_debuginfo/storage.c > +++ b/coregrind/m_debuginfo/storage.c > @@ -138,7 +138,31 @@ void ML_(ppDiCfSI) ( const XArray* /* of CfiExpr */ e… > [...] > + VG_(printf)("oldF6"); \ > + } else \ > + if (_how == CFIR_S390X_F7) { \ > + VG_(printf)("oldF7"); \ > + } else{ \ Missing space. > [...] > --- a/coregrind/m_libcassert.c > +++ b/coregrind/m_libcassert.c > @@ -176,6 +176,15 @@ > [...] > (srP)->misc.S390X.r_fp = fp; \ > (srP)->misc.S390X.r_lr = lr; \ > + /* ANDREAS 3 Apr 2019 FIXME r_f0..r_f7 */ \ > + (srP)->misc.S390X.r_f0 = 0;/*FIXME*/ \ > + (srP)->misc.S390X.r_f1 = 0;/*FIXME*/ \ Let me look into this, I'll provide a suitable inline assembly. > [...] > --- a/coregrind/m_machine.c > +++ b/coregrind/m_machine.c > @@ -118,6 +118,23 @@ void VG_(get_UnwindStartRegs) ( /*OUT*/UnwindStartReg… > [...] > regs->misc.S390X.r_lr > = VG_(threads)[tid].arch.vex.guest_LR; > + /* ANDREAS 3 Apr 2019 FIXME r_f0..r_f7: is this correct? */ > + regs->misc.S390X.r_f0 > + = VG_(threads)[tid].arch.vex.guest_v0.w64[0]; Looks correct to me. The FPRs overlap with the lower-addressed halves of the respective vector registers. > [...] > --- a/coregrind/m_signals.c > +++ b/coregrind/m_signals.c > @@ -526,6 +526,15 @@ typedef struct SigQueue { > (srP)->r_sp = (ULong)((uc)->uc_mcontext.regs.gprs[15]); \ > (srP)->misc.S390X.r_fp = (uc)->uc_mcontext.regs.gprs[11]; \ > (srP)->misc.S390X.r_lr = (uc)->uc_mcontext.regs.gprs[14]; \ > + /* ANDREAS 3 Apr 2019 FIXME r_f0..r_f7: is this correct? */ \ > + (srP)->misc.S390X.r_f0 = (uc)->uc_mcontext.fpregs.fprs[0]; \ Looks OK.
Thanks for the testing and review. > + For s390, the unwound registers are: R11(FP) R14(LR) R15(SP) F0 F2 F4 … > This doesn't match the current implementation, right? That's correct. The comment is out of date. I'll fix it. > In fact it seems that F0-F7 are unwound, although F0, F2, and F4 are > probably sufficient (I hope). I at first implemented only F0, F2 and F4, but then I thought it would be worth checking for "summarisation failures" (failures to represent the unwind info in V's internal structures) when reading all the unwind data for libc/libpthread/whatever when starting a hello-world program. From that it was obvious that the compiler really does use all of F0-F7 at least once. So I decided to do them all.
Created attachment 119257 [details] New GET_STARTREGS implementation for s390x (In reply to Julian Seward from comment #7) > [...] From that > it was obvious that the compiler really does use all of F0-F7 at least > once. So I decided to do them all. Oh, my. OK, here's a suggested new implementation of GET_STARTREGS for s390x in m_libcassert.c.
Fixed, d36ea889d8d8a1646be85c30ab5771af6912b7a1.