test_isa_2_06_part1.stdout.diff-LE shows: --- test_isa_2_06_part1.stdout.exp-LE 2018-12-06 07:26:05.719282264 -0500 +++ test_isa_2_06_part1.stdout.out 2018-12-06 10:12:42.028654431 -0500 @@ -79,33 +79,33 @@ lfiwzx: 1122867 => 1122867.00 Test P7 floating point convert instructions -fcfids 0010000000000001 => (raw sp) 59800000) -fcfids 00100094e0000359 => (raw sp) 598004a7) -fcfids 3fe0000000000001 => (raw sp) 5e7f8000) -fcfids 3fe00094e0000359 => (raw sp) 5e7f8002) -fcfids 8010000000000001 => (raw sp) deffe000) -fcfids 80100094e0000359 => (raw sp) deffdfff) -fcfids bfe0000000000001 => (raw sp) de804000) -fcfids bfe00094e0000359 => (raw sp) de803fff) -fcfids 0020000000000b01 => (raw sp) 5a000000) -fcfids 00000000203f0b3d => (raw sp) 4e00fc2d) -fcfids 00000000005a203d => (raw sp) 4ab4407a) -fcfids 8020000000000b01 => (raw sp) deffc000) -fcfids 80000000203f0b3d => (raw sp) df000000) +fcfids 0010000000000001 => (raw sp) 00000000) +fcfids 00100094e0000359 => (raw sp) 00000000) +fcfids 3fe0000000000001 => (raw sp) 00000000) +fcfids 3fe00094e0000359 => (raw sp) 00000000) +fcfids 8010000000000001 => (raw sp) 00000000) +fcfids 80100094e0000359 => (raw sp) 00000000) +fcfids bfe0000000000001 => (raw sp) 00000000) +fcfids bfe00094e0000359 => (raw sp) 00000000) +fcfids 0020000000000b01 => (raw sp) 00000000) +fcfids 00000000203f0b3d => (raw sp) 00000000) +fcfids 00000000005a203d => (raw sp) 00000000) +fcfids 8020000000000b01 => (raw sp) 00000000) +fcfids 80000000203f0b3d => (raw sp) 00000000) -fcfidus 0010000000000001 => (raw sp) 59800000) -fcfidus 00100094e0000359 => (raw sp) 598004a7) -fcfidus 3fe0000000000001 => (raw sp) 5e7f8000) -fcfidus 3fe00094e0000359 => (raw sp) 5e7f8002) -fcfidus 8010000000000001 => (raw sp) 5f001000) -fcfidus 80100094e0000359 => (raw sp) 5f001001) -fcfidus bfe0000000000001 => (raw sp) 5f3fe000) -fcfidus bfe00094e0000359 => (raw sp) 5f3fe001) -fcfidus 0020000000000b01 => (raw sp) 5a000000) -fcfidus 00000000203f0b3d => (raw sp) 4e00fc2d) -fcfidus 00000000005a203d => (raw sp) 4ab4407a) -fcfidus 8020000000000b01 => (raw sp) 5f002000) -fcfidus 80000000203f0b3d => (raw sp) 5f000000) +fcfidus 0010000000000001 => (raw sp) 00000000) +fcfidus 00100094e0000359 => (raw sp) 00000000) +fcfidus 3fe0000000000001 => (raw sp) 00000000) +fcfidus 3fe00094e0000359 => (raw sp) 00000000) +fcfidus 8010000000000001 => (raw sp) 00000000) +fcfidus 80100094e0000359 => (raw sp) 00000000) +fcfidus bfe0000000000001 => (raw sp) 00000000) +fcfidus bfe00094e0000359 => (raw sp) 00000000) +fcfidus 0020000000000b01 => (raw sp) 00000000) +fcfidus 00000000203f0b3d => (raw sp) 00000000) +fcfidus 00000000005a203d => (raw sp) 00000000) +fcfidus 8020000000000b01 => (raw sp) 00000000) +fcfidus 80000000203f0b3d => (raw sp) 00000000) fcfidu 0010000000000001 => (raw sp) 4330000000000001) fcfidu 00100094e0000359 => (raw sp) 43300094e0000359) This is with gcc-8.2.1-3.4.el8.ppc64le, glibc-2.28-39.el8.ppc64le and binutils-2.30-49.el8.ppc64le.
Started to investigate this one. Still more investigation to do, but I wanted to do a snapshot braindump at this point. It looks like the behavior is affected by the optimization level used to build the testcase itself. Using a significantly stripped down test that exercises the fcfids instruction similar to how the valgrind test does, I see # $> gcc -v gcc version 9.0.0 20181130 (experimental) (GCC) > gcc test_fcfids.c -O0 -o test_fcfids 11:37:31 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> valgrind ./test_fcfids ==104410== Memcheck, a memory error detector <...> 0.000000 0 0 testcase 0000000000000000 => (raw sp) 0 ) 1000.000000 1 4.65201e+18 testcase 408f400000000000 => (raw sp) 5e811e80 ) But with -O or -O1 (or higher), the 5e811e80 value disappears. 11:39:04 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> gcc test_fcfids.c -O1 -o test_fcfids 11:39:12 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> valgrind ./test_fcfids ==104524== Memcheck, a memory error detector <...> 0.000000 0 0 testcase 0000000000000000 => (raw sp) 0 ) 1000.000000 1 4.65201e+18 testcase 408f400000000000 => (raw sp) 0 ) # but is still present when run outside of valgrind. 11:39:14 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> ./test_fcfids 0.000000 0 0 testcase 0000000000000000 => (raw sp) 0 ) 1000.000000 1 4.65201e+18 testcase 408f400000000000 => (raw sp) 5e811e80 )
Created attachment 117371 [details] simpler standalone testcase simpler standalone testcase
The fact that it depends on the optimisation used to built the tests is usually a sign that there's something broken in the inline assembly, or maybe in this case, in the use of fixed registers. And so it just happens to work with one compiler and/or optimisation level, but not in general.
So, it looks like the correct value is in the register, we are losing it somewhere in the cast from float to (raw/hex) on the call to printf. I'm undecided whether this is simply an improper cast, or if there is something additionally wrong here. tentative patch (attached) while I think on this a bit more...
Created attachment 119038 [details] tentative / potential patch
(In reply to Will Schmidt from comment #4) > So, it looks like the correct value is in the register, we are losing it > somewhere in the cast from float to (raw/hex) on the call to printf. > I'm undecided whether this is simply an improper cast, or if there is > something additionally wrong here. > > tentative patch (attached) while I think on this a bit more... brain-dump update. I've bisected GCC to determine where the behavior changed. looks like the gcc commit referenced here exposed this issue. A sniff test suggests 401827 was also triggered by the same gcc change. commit 93dabbb66cf3e2495c382451f85a0c17c3c326eb (HEAD, refs/bisect/bad) Author: meissner <meissner@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Tue Sep 26 18:04:37 2017 +0000 2017-09-26 Michael Meissner <meissner@linux.vnet.ibm.com> * config/rs6000/rs6000.md (movsi_from_sf): Adjust code to eliminate doing a 32-bit shift right or vector extract after doing XSCVDPSPN. Use zero_extendsidi2 instead of p8_mfvsrd_4_disf to move the value to the GPRs. (movdi_from_sf_zero_ext): Likewise. (reload_gpr_from_vsxsf): Likewise. (p8_mfvsrd_4_disf): Delete, no longer used.
Created attachment 119203 [details] proposed patch (v2) For both this and 401827, the problem was exposed by a gcc change to *not* duplicate the result of some convert operation into 'undefined' portions of the target register. (And valgrind used/set the value in an undefined portion of the result. Adding printfs or playing with casts was enough to hide the problem, so this took a bit to figure out. For this patch, i've updated the valgrind emulation to fill in the proper (?) portion of the target registers. I have left the existing target register updates in place. This works in my sniff-tests, could still use validation in a cleanly built environment.
I'm confused about the top level diagnosis here. I see two possibilities: (1) If the test program, when run directly (meaning, not on V) produces different results depending on compiler version and/or optimisation level, then the test program is buggy. (2) If (1) isn't the case, and instead, the test program produces different results when run directly vs when run on V, then V is buggy. Note that"run directly" really does mean run directly on the machine, and not merely "whatever is in the .stdout.exp file". Will: have you definitely excluded (1) as a possibility? My initial assumption here was that this problem is (1), but from the comments above it's hard to be sure either way.
(In reply to Julian Seward from comment #8) > I'm confused about the top level diagnosis here. I see two possibilities: > > (1) If the test program, when run directly (meaning, not on V) produces > different results depending on compiler version and/or optimisation > level, then the test program is buggy. > > (2) If (1) isn't the case, and instead, the test program produces different > results when run directly vs when run on V, then V is buggy. ^ this one (option 2). :-) The testcase only fails when the testcase is run under valgrind and the testcase is built with a newer gcc. > Note that"run directly" really does mean run directly on the machine, > and not merely "whatever is in the .stdout.exp file". Correct. As part of running this down I iterated on debug while running the test both standalone and under valgrind. > Will: have you definitely excluded (1) as a possibility? My initial > assumption here was that this problem is (1), but from the comments above > it's hard to be sure either way. Yup. Comment 1 should have include that detail, but lots of associated subsequent noise to help confuse. I have previously spoken to Michael Meissner to confirm the behavior of the GCC change, this explanation is based on his comments, and in hindsight fits and explains what we were seeing. The xscvdpsp,xscvdpspn,xscvdpuxws instructions each convert double precision values to single precision values, and write the results into bits 0-32 of the 128 bit target register. To get the value into the normal position for a scalar register it needed to be right-shifted 32 bits, so gcc always did that. For Power9 we (toolchain) requested hardware put the result into the second 32-bit section so we could avoid the shift. It was then that we realized hardware was already putting the result into *both* of those 32-bit sections, so the compiler removed the (redundant) shift. Valgrind emulation only wrote the result to the first 32-bit section, so the issue was exposed when GCC dropped doing that redundant shift. The proposed patch duplicates the result into the second 32-bit section of the target register.
(In reply to Will Schmidt from comment #9) > > (2) If (1) isn't the case, and instead, the test program produces different > > results when run directly vs when run on V, then V is buggy. > > ^ this one (option 2). :-) Ok, thanks for the clarification. In that case, the patch looks OK, except for the fact that it duplicates the entire subtree for calculation of the first 32 bits of the result. That will cause the tree to be evaluated twice, unless you get lucky and the IR optimiser removes the duplication. But that usually doesn't happen, since CSE is expensive and is only used in specific situations. Please can you fix this by binding the duplicated subtrees to a new IRTemp and instead use the IRTemps twice? Here's an example for case 0x212 (xscvdpsp); the other cases should be obvious: case 0x212: // xscvdpsp (VSX Scalar round Double-Precision to single-precision and // Convert to Single-Precision format // Apr 2019 update - write the result to both halves of the // target VSR. (see bug 401827,401828). DIP("xscvdpsp v%u,v%u\n", XT, XB); IRTemp resultI32 = newTemp(Ity_I32); assign(resultI32, unop( Iop_ReinterpF32asI32, unop( Iop_TruncF64asF32, binop( Iop_RoundF64toF32, get_IR_roundingmode(), mkexpr( xB ) ) ) ); putVSReg( XT, binop( Iop_64HLtoV128, binop( Iop_32HLto64, mkexpr(resultI32), mkexpr(resultI32) ), mkU64( 0ULL ) ) ); break;
Created attachment 119218 [details] proposed patch (v3) reworked per feedback. (thanks). possibly one more spin after this. see subsequent comment/attachment.
Created attachment 119219 [details] vex : sanityCheckFail: exiting due to bad IR I got this error when regression testing this patch on a P8 BE system. Probably obvious, but I don't immediately understand what this is telling me. (I'll look more tomorrow) +IN STATEMENT: + +t13 = ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),F64i{0x........},t9))) + +ERROR = IRStmt.Put.Tmp: tmp and expr do not match
(In reply to Will Schmidt from comment #12) > Created attachment 119219 [details] > vex : sanityCheckFail: exiting due to bad IR > > I got this error when regression testing this patch on a P8 BE system. > Probably obvious, but I don't immediately understand what this is telling > me. (I'll look more tomorrow) > > +IN STATEMENT: > + > +t13 = > ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x... > .....:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)), > F64i{0x........},t9))) > + > +ERROR = IRStmt.Put.Tmp: tmp and expr do not match See just before: +IR SANITY CHECK FAILURE + +IRSB { + t0:I64 t1:I64 t2:I64 t3:I64 t4:I64 t5:I64 t6:I64 t7:V128 + t8:V128 t9:F64 t10:F64 t11:I32 t12:I32 t13:I32 t14:I64 t15:I64 + t16:I64 t17:V128 t18:I64 t19:I64 t20:I64 t21:I32 t22:I32 t23:I32 + t24:I64 t25:I32 t26:I32 t27:I32 So t13 has type I32, but ReinterpF64asI64 returns an I64. I think you had wanted to use Iop_ReinterpF32asI32.
(In reply to Mark Wielaard from comment #13) > > +t13 = > > ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x... > > .....:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)), > > F64i{0x........},t9))) > + t8:V128 t9:F64 t10:F64 t11:I32 t12:I32 t13:I32 t14:I64 > So t13 has type I32, but ReinterpF64asI64 returns an I64. > > I think you had wanted to use Iop_ReinterpF32asI32. RoundF64toF32 has a bit of a misleading name. The result type is still F64, but the value itself is rounded to F32 range: typeOfPrimop(..) contains: case Iop_RoundF64toF32: BINARY(ity_RMode,Ity_F64, Ity_F64); So I think in this case, t13 (is that the two-be-used-twice value?) needs to be an I64. That is, be initialised with newTemp(Ity_I64).
Hmm, ignore my previous suggestion. What's evident from the failure message is that 1. t13 is declared to be an I32 2. however, it is assigned an I64 value, as created by ReinterpF64asI64, which is why the sanity check failed 3. it is then used as an I32 value: 32HLto64(t13,t13). So (1) and (3) are consistent but (2) isn't. + t8:V128 t9:F64 t10:F64 t11:I32 t12:I32 t13:I32 t14:I64 + t13 = ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),F64i{0x........},t9))) + PUT(784) = 64HLtoV128(ReinterpF64asI64(32HLto64(t13,t13)),0x........:I64)
Created attachment 119233 [details] proposed patch (v4) Updated per feedback received. tested OK on P7,P8LE,P8BE,P9 (LE).
Looks good to me. Land!
I reviewed and tested the attached patch to fix this issue. The patch fixes this bug and bugzilla 401827. The patch was tested on Power 7, Power 8 LE, Power 8 BE, Power 9. The patch commit: commit 82e94fff802aece376d5ca8458ef49d24afd7bdf Author: Carl Love <carll@us.ibm.com> Date: Thu Apr 4 12:31:05 2019 -0500 PPC64, patch to test case issues reported in bugzilla 401827 and 401828.