Summary: | s390x: /bin/true segfaults with "grail" enabled | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Andreas Arnez <arnez> |
Component: | memcheck | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | NOR | ||
Version First Reported In: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
Patch to enable "grail" on s390x
Excerpt of Valgrind output with --trace-flags=10000000 Longer disassembly for comment 5 s390x: Fix register usage of conditional moves |
Description
Andreas Arnez
2020-02-07 19:01:06 UTC
The crash happens in glibc's elf/dl-lookup.c in do_lookup_x, when trying to access the first element of the array 'list'. It seems that the register containing the address is corrupted; it was just copied from %r4 into %r6, but the values differ. Some observations: • When using `--vex-guest-max-insns=40' or lower, the problem disappears. • The option `--vex-iropt-register-updates=allregs-at-each-insn' does not help. • When attaching GDB, it seems that the corruption happens on the `lgr' instruction that is supposed to copy the value from %r4. Instead of copying, %r6 is overwritten with some random (?) value: ┌──── │ 1: x/i $pc │ => 0x4009e36 <do_lookup_x+182>: lgr %r6,%r4 │ 2: /x $r4 = 0x4829f90 │ 3: /x $r6 = 0x1fff000260 │ (gdb) si │ 1: x/i $pc │ => 0x4009e3a <do_lookup_x+186>: cgije %r2,0,0x4009f34 <do_lookup_x+436> │ 2: /x $r4 = 0x4829f90 │ 3: /x $r6 = 0x10 └──── • The `lgr' instruction is affected by grail's transformation. After the transformation, the register move is performed conditionally. Created attachment 125751 [details]
Patch to enable "grail" on s390x
Created attachment 125752 [details]
Excerpt of Valgrind output with --trace-flags=10000000
I studied this more today. One theory I had was that the insn selector was generating wrong code for converting a value in the lowest bit of a register into a condition code (function s390_isel_cc(), case "/* Variable: values are 1 or 0 */"), due to testing the whole register without first masking off all but the least significant bit. However, I added code to first copy the register, do the masking, and only then the test. This had no effect :-( I then looked in detail at the front end actions (the &&-recovery) for the basic block(s) where the failure happens. This all seems to be correct to me. Plus, all that front end logic is target-independent, and if it was broken I would have expected it to have failed on some other target by now. I looked some s390 insn selector bits and pieces relating to converting integer values to/from condition codes, but didn't see any error there. So I'm still not sure what is going on. Given that the front end stuff works OK on other targets, it might be that this is an s390 insn selector bug which has was always there but has only shown up because the &&-recovery transformation causes the back end to be given a wider variety of IR. Even that sounds pretty unlikely, though. As a next step I am inclined to add printf lines for all cases (rules) in the insn selector. Then run the test case with and without &&-recovery enabled, so as to find the sets of rules used in both cases, and diff them, to see what is only used when &&-recovery is enabled. Then look more closely at those cases. Right now I cannot think of anything else to do. I will try to find time to do that in the next two weeks or so. (In reply to Julian Seward from comment #4) > As a next step I am inclined to add printf lines for all cases (rules) > in the insn selector. Then run the test case with and without &&-recovery > enabled, so as to find the sets of rules used in both cases, and diff > them, to see what is only used when &&-recovery is enabled. Then look > more closely at those cases. I did that yesterday. The only rules I found like that were for Iop_And1 and Iop_Or1, as expected. And these still look correct to me. So I really don't think this is an instruction selector problem, and for reasons explained in comment #4, I don't think this is a front end problem either. That leaves as possibilities either the register allocator (core logic), the s390-specific support for the register allocator (getRegUsage_S390, basically), or something to do with the stack layout. I experimented with register allocation a bit. getRRegUniverse_S390() tells the register allocator which registers are available. If register allocation is correct, removing some registers from the set of available ones should not change the behaviour of the code, affecting only performance. But it does make a difference. getRRegUniverse_S390() provides GPRs in two separate groups: 1..5 and 6..11. Removing one group or the other does not change the place (in the guest code) where V crashes, but it does change the error messages at the crash point. This seems like a big red flag to me. With the GPR group 6..11 removed from allocation, what I notice is that the crash occurs soon after a call to a helper function. And it crashes due to an invalid memory reference, using as address, a value which is spilled before the call and reloaded afterwards. I'll put a longer disassembly in an attached file, but here are the essentials: // r13 is the guest state pointer. It points to an area which // is: 3 copies of struct VexGuestS390XState, followed by a spill area. 0x0000001003c5be5a: llilf %r0,67149062 ; 0x4009d06 0x0000001003c5be60: stg %r0,720(%r13) ; set guest PC to 0x4009d06 // Do stuff (I don't know what; probably not important) 0x0000001003c5be66: lg %r5,656(%r13) 0x0000001003c5be6c: sllg %r5,%r5,3 0x0000001003c5be72: ag %r5,624(%r13) 0x0000001003c5be78: lg %r4,1472(%r13) 0x0000001003c5be7e: sllg %r4,%r4,3 0x0000001003c5be84: og %r4,1440(%r13) 0x0000001003c5be8a: ltgr %r4,%r4 0x0000001003c5be8e: stg %r5,2464(%r13) ; spill <--- 0x0000001003c5be94: je 0x1003c5beae // helper call; sequence created by s390_insn_helper_call_emit() 0x0000001003c5be98: iihf %r1,8 0x0000001003c5be9e: iilf %r1,98288 0x0000001003c5bea4: stfpc 232(%r15) 0x0000001003c5bea8: basr %r14,%r1 0x0000001003c5beaa: lfpc 232(%r15) // after the call 0x0000001003c5beae: lg %r5,2464(%r13) ; reload <--- 0x0000001003c5beb4: lgr %r2,%r5 // another helper call 0x0000001003c5beb8: iihf %r1,8 0x0000001003c5bebe: iilf %r1,95000 0x0000001003c5bec4: stfpc 232(%r15) 0x0000001003c5bec8: basr %r14,%r1 0x0000001003c5beca: lfpc 232(%r15) // after the call 0x0000001003c5bece: lgr %r5,%r2 0x0000001003c5bed2: lg %r3,2464(%r13) ; reload <--- => 0x0000001003c5bed8: lg %r4,0(%r3) // crash (r3 is near zero) <--- So it might be that 2464(%r13) got corrupted across one of the helper calls. From comments on VexGuestS390XState, it appears that this has size 816 bytes. If that's correct, then the 3 copies of it starting at 0(%r13) have total size 2448 bytes, so offset 2464 is 16 bytes into the spill area. So that's my analysis so far. I think it is suspicious that changing the available register set changes the messages shown at the failure site. But the analysis might be wrong. For example, it may be the value that is computed into %r5 and then spilled into 2464(%r13) is already wrong, and that the spilling is OK. Created attachment 126754 [details] Longer disassembly for comment 5 Created attachment 126871 [details]
s390x: Fix register usage of conditional moves
(In reply to Andreas Arnez from comment #7) > Created attachment 126871 [details] > s390x: Fix register usage of conditional moves Looks good to me. Two minor points: For the NEVER case, it might be more "symmetrical" in relation to the ALWAYS case, to say that it modifies dst, rather than saying nothing. (I know it modifies it with the same value). Then in effect the code says: NEVER= read dst, write dst; ALWAYS = read src, write dst although I think the absolute safest thing would be to put an assertion in the construction for this insn to disallow NEVER and ALWAYS, since they are degenerate cases, and I assume will never occur in practice. Second point is .. are there any other cond-move insn variants? FP, Vector etc? +1 to land with any of the above changes you think are good (including none). (In reply to Julian Seward from comment #8) > For the NEVER case, it might be more "symmetrical" in relation to > the ALWAYS case, to say that it modifies dst, rather than saying > nothing. (I know it modifies it with the same value). Then > in effect the code says: > > NEVER= read dst, write dst; > ALWAYS = read src, write dst The rationale here was that NEVER results in no s390x instruction being emitted, so nothing is actually read or written. > although I think the absolute safest thing would be to put an assertion > in the construction for this insn to disallow NEVER and ALWAYS, since they > are degenerate cases, and I assume will never occur in practice. It seems that s390_isel_cc can at least theoretically yield NEVER or ALWAYS (in case of an Int1 constant, Ico_U1), so I guess I feel more comfortable with leaving support for that in right now. Unless it's guaranteed that Ico_U1 can't survive until isel? Of course, we could also handle these cases specially instead of always creating a conditional move. Perhaps I'll look into that later, but I prefer to do that independently from this bug. > Second point is .. are there any other cond-move insn variants? > FP, Vector etc? Not yet, but I guess I'll have to add them in order to fix the remaining "grail" fails on s390x. > +1 to land with any of the above changes you think are good (including none). OK, thanks for looking into this! Pushed as 942a48c1db83ffbcbf1f5781d5607f3b42849b67. This fixes the segfaults in my testing. ("Grail" still shows problems on s390x, due to missing support for Iex_ITE expressions for anything else but integer types. Opened Bug 418997 for this.) |