Bug 401828 - none/tests/ppc64/test_isa_2_06_part1 failure on ppc64le (fcfids and fcfidus)
Summary: none/tests/ppc64/test_isa_2_06_part1 failure on ppc64le (fcfids and fcfidus)
Status: CLOSED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-06 15:15 UTC by Mark Wielaard
Modified: 2019-04-04 17:36 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
simpler standalone testcase (793 bytes, text/x-csrc)
2019-01-09 17:49 UTC, Will Schmidt
Details
tentative / potential patch (984 bytes, patch)
2019-03-25 21:57 UTC, Will Schmidt
Details
proposed patch (v2) (7.05 KB, patch)
2019-04-01 21:17 UTC, Will Schmidt
Details
proposed patch (v3) (7.81 KB, patch)
2019-04-02 22:37 UTC, Will Schmidt
Details
vex : sanityCheckFail: exiting due to bad IR (4.65 KB, patch)
2019-04-02 22:41 UTC, Will Schmidt
Details
proposed patch (v4) (6.24 KB, patch)
2019-04-03 19:44 UTC, Will Schmidt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Wielaard 2018-12-06 15:15:57 UTC
test_isa_2_06_part1.stdout.diff-LE shows:

--- test_isa_2_06_part1.stdout.exp-LE   2018-12-06 07:26:05.719282264 -0500
+++ test_isa_2_06_part1.stdout.out      2018-12-06 10:12:42.028654431 -0500
@@ -79,33 +79,33 @@
 lfiwzx: 1122867 => 1122867.00
 
 Test P7 floating point convert instructions
-fcfids 0010000000000001 => (raw sp) 59800000)
-fcfids 00100094e0000359 => (raw sp) 598004a7)
-fcfids 3fe0000000000001 => (raw sp) 5e7f8000)
-fcfids 3fe00094e0000359 => (raw sp) 5e7f8002)
-fcfids 8010000000000001 => (raw sp) deffe000)
-fcfids 80100094e0000359 => (raw sp) deffdfff)
-fcfids bfe0000000000001 => (raw sp) de804000)
-fcfids bfe00094e0000359 => (raw sp) de803fff)
-fcfids 0020000000000b01 => (raw sp) 5a000000)
-fcfids 00000000203f0b3d => (raw sp) 4e00fc2d)
-fcfids 00000000005a203d => (raw sp) 4ab4407a)
-fcfids 8020000000000b01 => (raw sp) deffc000)
-fcfids 80000000203f0b3d => (raw sp) df000000)
+fcfids 0010000000000001 => (raw sp) 00000000)
+fcfids 00100094e0000359 => (raw sp) 00000000)
+fcfids 3fe0000000000001 => (raw sp) 00000000)
+fcfids 3fe00094e0000359 => (raw sp) 00000000)
+fcfids 8010000000000001 => (raw sp) 00000000)
+fcfids 80100094e0000359 => (raw sp) 00000000)
+fcfids bfe0000000000001 => (raw sp) 00000000)
+fcfids bfe00094e0000359 => (raw sp) 00000000)
+fcfids 0020000000000b01 => (raw sp) 00000000)
+fcfids 00000000203f0b3d => (raw sp) 00000000)
+fcfids 00000000005a203d => (raw sp) 00000000)
+fcfids 8020000000000b01 => (raw sp) 00000000)
+fcfids 80000000203f0b3d => (raw sp) 00000000)
 
-fcfidus 0010000000000001 => (raw sp) 59800000)
-fcfidus 00100094e0000359 => (raw sp) 598004a7)
-fcfidus 3fe0000000000001 => (raw sp) 5e7f8000)
-fcfidus 3fe00094e0000359 => (raw sp) 5e7f8002)
-fcfidus 8010000000000001 => (raw sp) 5f001000)
-fcfidus 80100094e0000359 => (raw sp) 5f001001)
-fcfidus bfe0000000000001 => (raw sp) 5f3fe000)
-fcfidus bfe00094e0000359 => (raw sp) 5f3fe001)
-fcfidus 0020000000000b01 => (raw sp) 5a000000)
-fcfidus 00000000203f0b3d => (raw sp) 4e00fc2d)
-fcfidus 00000000005a203d => (raw sp) 4ab4407a)
-fcfidus 8020000000000b01 => (raw sp) 5f002000)
-fcfidus 80000000203f0b3d => (raw sp) 5f000000)
+fcfidus 0010000000000001 => (raw sp) 00000000)
+fcfidus 00100094e0000359 => (raw sp) 00000000)
+fcfidus 3fe0000000000001 => (raw sp) 00000000)
+fcfidus 3fe00094e0000359 => (raw sp) 00000000)
+fcfidus 8010000000000001 => (raw sp) 00000000)
+fcfidus 80100094e0000359 => (raw sp) 00000000)
+fcfidus bfe0000000000001 => (raw sp) 00000000)
+fcfidus bfe00094e0000359 => (raw sp) 00000000)
+fcfidus 0020000000000b01 => (raw sp) 00000000)
+fcfidus 00000000203f0b3d => (raw sp) 00000000)
+fcfidus 00000000005a203d => (raw sp) 00000000)
+fcfidus 8020000000000b01 => (raw sp) 00000000)
+fcfidus 80000000203f0b3d => (raw sp) 00000000)
 
 fcfidu 0010000000000001 => (raw sp) 4330000000000001)
 fcfidu 00100094e0000359 => (raw sp) 43300094e0000359)

This is with gcc-8.2.1-3.4.el8.ppc64le, glibc-2.28-39.el8.ppc64le and binutils-2.30-49.el8.ppc64le.
Comment 1 Will Schmidt 2019-01-09 17:48:36 UTC
Started to investigate this one.  Still more investigation to do, but I wanted to do a snapshot braindump at this point.

It looks like the behavior is affected by the optimization level used to build the testcase itself.   Using a significantly stripped down test that exercises the fcfids instruction similar to how the valgrind test does, I see

# 
$> gcc -v 
gcc version 9.0.0 20181130 (experimental) (GCC) 

> gcc     test_fcfids.c -O0 -o test_fcfids
11:37:31 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> valgrind ./test_fcfids
==104410== Memcheck, a memory error detector
<...>
0.000000
0 0
testcase 0000000000000000 => (raw sp)                0 )
1000.000000
1 4.65201e+18
testcase 408f400000000000 => (raw sp)         5e811e80 )

But with -O or -O1 (or higher), the 5e811e80 value disappears.
11:39:04 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> gcc     test_fcfids.c -O1 -o test_fcfids

11:39:12 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> valgrind ./test_fcfids
==104524== Memcheck, a memory error detector
<...>
0.000000
0 0
testcase 0000000000000000 => (raw sp)                0 )
1000.000000
1 4.65201e+18
testcase 408f400000000000 => (raw sp)                0 )

# but is still present when run outside of valgrind.
11:39:14 0 p8 ppc64le willschm@pike:~/valgrind_builds/valgrind/none/tests/ppc64> ./test_fcfids
0.000000
0 0
testcase 0000000000000000 => (raw sp)                0 )
1000.000000
1 4.65201e+18
testcase 408f400000000000 => (raw sp)         5e811e80 )
Comment 2 Will Schmidt 2019-01-09 17:49:59 UTC
Created attachment 117371 [details]
simpler standalone testcase

simpler standalone testcase
Comment 3 Julian Seward 2019-03-10 08:56:58 UTC
The fact that it depends on the optimisation used to built the tests
is usually a sign that there's something broken in the inline assembly,
or maybe in this case, in the use of fixed registers.  And so it just
happens to work with one compiler and/or optimisation level, but not in
general.
Comment 4 Will Schmidt 2019-03-25 21:57:08 UTC
So, it looks like the correct value is in the register, we are losing it somewhere in the cast from float to (raw/hex) on the call to printf.
I'm undecided whether this is simply an improper cast, or if there is something additionally wrong here.

tentative patch (attached) while I think on this a bit more...
Comment 5 Will Schmidt 2019-03-25 21:57:59 UTC
Created attachment 119038 [details]
tentative / potential patch
Comment 6 Will Schmidt 2019-03-28 15:02:51 UTC
(In reply to Will Schmidt from comment #4)
> So, it looks like the correct value is in the register, we are losing it
> somewhere in the cast from float to (raw/hex) on the call to printf.
> I'm undecided whether this is simply an improper cast, or if there is
> something additionally wrong here.
> 
> tentative patch (attached) while I think on this a bit more...

brain-dump update.

I've bisected GCC to determine where the behavior changed.  looks like the gcc commit referenced here exposed this issue.    A sniff test suggests 401827 was also triggered by the same gcc change.

  commit 93dabbb66cf3e2495c382451f85a0c17c3c326eb (HEAD, refs/bisect/bad)
  Author: meissner <meissner@138bc75d-0d04-0410-961f-82ee72b054a4>
  Date:   Tue Sep 26 18:04:37 2017 +0000

    2017-09-26  Michael Meissner  <meissner@linux.vnet.ibm.com>
    
            * config/rs6000/rs6000.md (movsi_from_sf): Adjust code to
            eliminate doing a 32-bit shift right or vector extract after doing
            XSCVDPSPN.  Use zero_extendsidi2 instead of p8_mfvsrd_4_disf to
            move the value to the GPRs.
            (movdi_from_sf_zero_ext): Likewise.
            (reload_gpr_from_vsxsf): Likewise.
            (p8_mfvsrd_4_disf): Delete, no longer used.
Comment 7 Will Schmidt 2019-04-01 21:17:43 UTC
Created attachment 119203 [details]
proposed patch (v2)

For both this and 401827, the problem was exposed by a gcc change to *not* duplicate the result of some convert operation into 'undefined' portions of the target register.  (And valgrind used/set the value in an undefined portion of the result.  Adding printfs or playing with casts was enough to hide the problem, so this took a bit to figure out. For this patch, i've updated the valgrind emulation to fill in the proper (?) portion of the target registers.  I have left the existing target register updates in place. 
This works in my sniff-tests, could still use validation in a cleanly built environment.
Comment 8 Julian Seward 2019-04-02 08:25:18 UTC
I'm confused about the top level diagnosis here.  I see two possibilities:

(1) If the test program, when run directly (meaning, not on V) produces
    different results depending on compiler version and/or optimisation
    level, then the test program is buggy.

(2) If (1) isn't the case, and instead, the test program produces different
    results when run directly vs when run on V, then V is buggy.

Note that"run directly" really does mean run directly on the machine,
and not merely "whatever is in the .stdout.exp file".

Will: have you definitely excluded (1) as a possibility?  My initial 
assumption here was that this problem is (1), but from the comments above
it's hard to be sure either way.
Comment 9 Will Schmidt 2019-04-02 16:03:44 UTC
(In reply to Julian Seward from comment #8)
> I'm confused about the top level diagnosis here.  I see two possibilities:
> 
> (1) If the test program, when run directly (meaning, not on V) produces
>     different results depending on compiler version and/or optimisation
>     level, then the test program is buggy.
> 
> (2) If (1) isn't the case, and instead, the test program produces different
>     results when run directly vs when run on V, then V is buggy.

^ this one (option 2).  :-)

The testcase only fails when the testcase is run under valgrind and the testcase is built with a newer gcc. 

> Note that"run directly" really does mean run directly on the machine,
> and not merely "whatever is in the .stdout.exp file".

Correct.  As part of running this down I iterated on debug while running the test both standalone and under valgrind.  


> Will: have you definitely excluded (1) as a possibility?  My initial 
> assumption here was that this problem is (1), but from the comments above
> it's hard to be sure either way.

Yup.  Comment 1 should have include that detail, but lots of associated subsequent noise to help confuse.


I have previously spoken to Michael Meissner to confirm the behavior of the GCC change, this explanation is based on his comments, and in hindsight fits and explains what we were seeing.

The xscvdpsp,xscvdpspn,xscvdpuxws instructions each convert double precision values to single precision values, and write the results into bits 0-32 of the 128 bit target register.  To get the value into the normal position for a scalar register it needed to be right-shifted 32 bits, so gcc always did that.  For Power9 we (toolchain) requested hardware put the result into the second 32-bit section so we could avoid the shift.  It was then that we realized hardware was already putting the result into *both* of those 32-bit sections, so the compiler removed the (redundant) shift.

Valgrind emulation only wrote the result to the first 32-bit section, so the issue was exposed when GCC dropped doing that redundant shift.

The proposed patch duplicates the result into the second 32-bit section of the target register.
Comment 10 Julian Seward 2019-04-02 17:37:22 UTC
(In reply to Will Schmidt from comment #9)
> > (2) If (1) isn't the case, and instead, the test program produces different
> >     results when run directly vs when run on V, then V is buggy.
> 
> ^ this one (option 2).  :-)

Ok, thanks for the clarification.  In that case, the patch looks OK, except
for the fact that it duplicates the entire subtree for calculation of the 
first 32 bits of the result.  That will cause the tree to be evaluated twice,
unless you get lucky and the IR optimiser removes the duplication.  But 
that usually doesn't happen, since CSE is expensive and is only used in
specific situations.

Please can you fix this by binding the duplicated subtrees to a new IRTemp
and instead use the IRTemps twice?  Here's an example for case 0x212 (xscvdpsp);
the other cases should be obvious:

case 0x212: // xscvdpsp (VSX Scalar round Double-Precision to single-precision and
            //           Convert to Single-Precision format
            // Apr 2019 update - write the result to both halves of the
            // target VSR.  (see bug 401827,401828).
   DIP("xscvdpsp v%u,v%u\n", XT, XB);
   IRTemp resultI32 = newTemp(Ity_I32);
   assign(resultI32, unop( Iop_ReinterpF32asI32,
                                 unop( Iop_TruncF64asF32,
                                       binop( Iop_RoundF64toF32,
                                              get_IR_roundingmode(),
                                              mkexpr( xB ) ) ) );
   putVSReg( XT,
             binop( Iop_64HLtoV128,
                    binop( Iop_32HLto64, mkexpr(resultI32),
                                         mkexpr(resultI32) ),
                    mkU64( 0ULL ) ) );
   break;
Comment 11 Will Schmidt 2019-04-02 22:37:39 UTC
Created attachment 119218 [details]
proposed patch (v3)

reworked per feedback.  (thanks).
possibly one more spin after this. see subsequent comment/attachment.
Comment 12 Will Schmidt 2019-04-02 22:41:37 UTC
Created attachment 119219 [details]
vex : sanityCheckFail: exiting due to bad IR

I got this error when regression testing this patch on a P8 BE system.   Probably obvious, but I don't immediately understand what this is telling me.  (I'll look more tomorrow)

+IN STATEMENT:
+
+t13 = ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),F64i{0x........},t9)))
+
+ERROR = IRStmt.Put.Tmp: tmp and expr do not match
Comment 13 Mark Wielaard 2019-04-02 22:52:51 UTC
(In reply to Will Schmidt from comment #12)
> Created attachment 119219 [details]
> vex : sanityCheckFail: exiting due to bad IR
> 
> I got this error when regression testing this patch on a P8 BE system.  
> Probably obvious, but I don't immediately understand what this is telling
> me.  (I'll look more tomorrow)
> 
> +IN STATEMENT:
> +
> +t13 =
> ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x...
> .....:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),
> F64i{0x........},t9)))
> +
> +ERROR = IRStmt.Put.Tmp: tmp and expr do not match

See just before:

+IR SANITY CHECK FAILURE
+
+IRSB {
+   t0:I64   t1:I64   t2:I64   t3:I64   t4:I64   t5:I64   t6:I64   t7:V128
+   t8:V128   t9:F64   t10:F64   t11:I32   t12:I32   t13:I32   t14:I64   t15:I64
+   t16:I64   t17:V128   t18:I64   t19:I64   t20:I64   t21:I32   t22:I32   t23:I32
+   t24:I64   t25:I32   t26:I32   t27:I32   

So t13 has type I32, but ReinterpF64asI64 returns an I64.

I think you had wanted to use Iop_ReinterpF32asI32.
Comment 14 Julian Seward 2019-04-03 04:51:27 UTC
(In reply to Mark Wielaard from comment #13)

> > +t13 =
> > ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x...
> > .....:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),
> > F64i{0x........},t9)))

> +   t8:V128   t9:F64   t10:F64   t11:I32   t12:I32   t13:I32   t14:I64  

> So t13 has type I32, but ReinterpF64asI64 returns an I64.
> 
> I think you had wanted to use Iop_ReinterpF32asI32.

RoundF64toF32 has a bit of a misleading name.  The result type is still F64,
but the value itself is rounded to F32 range:

  typeOfPrimop(..) contains:
      case Iop_RoundF64toF32:
         BINARY(ity_RMode,Ity_F64, Ity_F64);

So I think in this case, t13 (is that the two-be-used-twice value?) needs
to be an I64.  That is, be initialised with newTemp(Ity_I64).
Comment 15 Julian Seward 2019-04-03 05:02:07 UTC
Hmm, ignore my previous suggestion.  What's evident from the
failure message is that

1. t13 is declared to be an I32

2. however, it is assigned an I64 value, as created by ReinterpF64asI64,
   which is why the sanity check failed

3. it is then used as an I32 value: 32HLto64(t13,t13).

So (1) and (3) are consistent but (2) isn't.

+   t8:V128   t9:F64   t10:F64   t11:I32   t12:I32   t13:I32   t14:I64  

+   t13 = ReinterpF64asI64(RoundF64toF32(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),DivF64(Xor32(t11,And32(Shl32(t11,0x........:I8),0x........:I32)),F64i{0x........},t9)))
+   PUT(784) = 64HLtoV128(ReinterpF64asI64(32HLto64(t13,t13)),0x........:I64)
Comment 16 Will Schmidt 2019-04-03 19:44:07 UTC
Created attachment 119233 [details]
proposed patch (v4)

Updated per feedback received.  tested OK on P7,P8LE,P8BE,P9 (LE).
Comment 17 Julian Seward 2019-04-03 21:02:56 UTC
Looks good to me.  Land!
Comment 18 Carl Love 2019-04-04 17:35:17 UTC
I reviewed and tested the attached patch to fix this issue.  The patch fixes this bug and bugzilla 401827.  The patch was tested on Power 7, Power 8 LE, Power 8 BE, Power 9.

The patch commit:

commit 82e94fff802aece376d5ca8458ef49d24afd7bdf
Author: Carl Love <carll@us.ibm.com>
Date:   Thu Apr 4 12:31:05 2019 -0500

    PPC64, patch to test case issues reported in bugzilla 401827 and 401828.