Bug 427404 - PPC ISA 3.1 support is missing, part 6
Summary: PPC ISA 3.1 support is missing, part 6
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (other bugs)
Version First Reported In: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-06 22:16 UTC by Carl Love
Modified: 2020-11-19 23:13 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
Functional support fix (4.35 KB, patch)
2020-10-06 22:17 UTC, Carl Love
Details
testsuite support for the reduced precision outer product operations (585.60 KB, patch)
2020-10-06 22:18 UTC, Carl Love
Details
Funtional support ISA 3.1 for reduced precision outer product operations (100.14 KB, patch)
2020-10-06 22:18 UTC, Carl Love
Details
functional support ISA 3.1 for reduced precision outer product operations (91.59 KB, patch)
2020-10-21 21:28 UTC, Carl Love
Details
functional support ISA 3.1 for reduced precision outer product operations (91.31 KB, patch)
2020-10-26 01:27 UTC, Carl Love
Details
functional support ISA 3.1 for reduced precision outer product operations (115.89 KB, patch)
2020-10-28 15:48 UTC, Carl Love
Details
function support ISA 3.1 for reduced precision outer product operations (91.01 KB, patch)
2020-11-05 18:06 UTC, Carl Love
Details
Changes to make the setup_fxstate_struct() more precise. (4.59 KB, patch)
2020-11-05 20:55 UTC, Carl Love
Details
function support ISA 3.1 for reduced precision outer product operations (92.38 KB, patch)
2020-11-05 21:01 UTC, Carl Love
Details
functional support ISA 3.1 for reduced precision outer product operations (92.45 KB, patch)
2020-11-10 18:29 UTC, Carl Love
Details
functional support fix (3.14 KB, patch)
2020-11-10 18:31 UTC, Carl Love
Details
Fix for the xscvhpdp assembler line issue (2.72 KB, patch)
2020-11-12 22:20 UTC, Carl Love
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carl Love 2020-10-06 22:16:17 UTC
Part 6 of the ISA 3.1 support

Previous patch set is in https://bugs.kde.org/show_bug.cgi?id=427401
Comment 1 Carl Love 2020-10-06 22:17:35 UTC
Created attachment 132180 [details]
Functional support fix
Comment 2 Carl Love 2020-10-06 22:18:08 UTC
Created attachment 132181 [details]
testsuite support for the reduced precision outer product operations
Comment 3 Carl Love 2020-10-06 22:18:53 UTC
Created attachment 132182 [details]
Funtional support ISA 3.1 for reduced precision outer product operations
Comment 4 Carl Love 2020-10-06 22:36:50 UTC
The functional support for the inner product support uses a clean helper to implement the functionality.  Total there will be something like 30 of these outer product instructions that need to be supported.  They are similar to a matrix multiply operation which is rather complicated.  It seems very unlikely that another architecture will implement something similar so adding a bunch of ppc specific IOps is not appealing.  

The other issue is these instructions read and write the results to the new accumulator registers.  There are eight 512-bit accumulators that are read/written as a 128-bit word.  Each of the eight Accumulators are associated with VSR registers.  For example ACC[0][0] is associated with VSR[0], i.e. the first 128-bit word in ACC zero is associated with the first VSR0, ACC[0][1] is associated with VSR1, ACC[0][2] is associated with VSR2, ACC[0][3] is associated with VSR3, ACC[1][0] is associated with VSR4, etc.

There are instructions for moving values to/from ACC[i] and its corressponding VSRs.  If we were to try and implement these instructions as IOps, it would require a bit of overhead to setup the guest ACC[i] entry before issuing the instruction and then saving the results to the guest ACC[i] entry.  I think it would be possible but a little messy.

At this point, it appears these instructions will only show up in hand coded library routines.  These are not instructions that GCC will issue for normal user code.  So anyway, it seems like doing these as clean helpers is reasonable instead of creating a bunch of new IOps.

The one issue with the implementation is the clean helper interface can only return 64-bits.  So the helper is called twice, the first time it returns the upper 64-bits of the result, the second time the lower 64-bit of the result.  Not an ideal solution.  I did look at trying to extend the clean helper interface to return 128-bits but that did not appear to be very straight forward.

Anyway, the code review needs to decide if this is really the best implementation or not for these instructions given the issues outlined above.
Comment 5 Carl Love 2020-10-21 21:28:17 UTC
Created attachment 132620 [details]
functional support ISA 3.1 for reduced precision outer product operations

Updated the patch to use dirty helpers for the outer product instructions rather then the clean helper.  The advantage is the helper is only called once and the 128-bit result is written directly to the guest state.  Previous implementation used a clean helper that was called twice to calculate the 128-bit result but then only return the upper or lower 64-bits of the result.
Comment 6 Julian Seward 2020-10-23 16:13:16 UTC
Functional support fix (attachment 132180 [details]) -- OK to land.

--------------------------------------------------------------------

testsuite support for the reduced precision outer product operations
  (attachment 132181 [details]) -- OK to land.

--------------------------------------------------------------------

functional support ISA 3.1 for reduced precision outer product operations
  (attachment 132620 [details])
  I have some correctness concerns regarding setup_fxstate_struct and
  some other comments.

setup_fxstate_struct: this worries me.  It claims that 4 regs are written,
  but not read.  Is that really correct -- that none of the regs are read?
  How can that be useful?

write_ACC_entry
get_ACC_entry
  better to crash the system if `acc` or `reg` is out of range, rather than printing
  an error message and continuing incorrectly.  Maybe just convert the

  if ((acc < 0) || (acc > 7)) {

  etc into the relevant assert: vassert(acc >= 0 && acc <= 7) etc

putACC
getACC
  similarly, assert if index or reg are invalid.

setup_fxstate_struct
  similarly, assert if AT is out of range (+ couldn't you use a switch instead
  of a chain of ifs?

dis_vsx_accumulator_prefix
  use switch instead of long chain of ifs?
Comment 7 Carl Love 2020-10-26 01:27:57 UTC
Created attachment 132743 [details]
functional support ISA 3.1 for reduced precision outer product operations

Addressed the comments from Julian to use switch instead of if else if.  Use asserts to check for bad inputs in the functions
Comment 8 Carl Love 2020-10-28 15:48:00 UTC
Created attachment 132844 [details]
functional support ISA 3.1 for reduced precision outer product operations

Updated patch to be consistent with the prior patch change for the popcount helper.
Comment 9 Julian Seward 2020-11-05 17:52:47 UTC
(In reply to Carl Love from comment #8)
> Created attachment 132844 [details]
> functional support ISA 3.1 for reduced precision outer product operations
> 
> Updated patch to be consistent with the prior patch change for the popcount
> helper.

Two things:

* I tried to find the function setup_fxstate_struct in this, but failed to.
  What happened to it?

* There is really a lot of debug printing in fns vsx_matrix_8bit_ger_helper
  and many others.  It makes those functions hard to read.  I'd prefer if you
  make all the test cases pass, and then remove the debug printing.
Comment 10 Carl Love 2020-11-05 18:06:08 UTC
Created attachment 133058 [details]
function support ISA 3.1 for reduced precision outer product operations

Update the patch.  Re-loaded the old file the last time.
Comment 11 Julian Seward 2020-11-05 18:33:46 UTC
OK to land provided that setup_fxstate_struct Read/Write/Modify annotations are
set more precisely, so as to avoid false positives and false negatives from 
Memcheck.
Comment 12 Carl Love 2020-11-05 20:55:58 UTC
Created attachment 133068 [details]
Changes to make the setup_fxstate_struct() more precise.

The following changes were made to the Functional support ISA 3.1 for reduced precision outer product operations to make the setup_fxstate_struct() more precise.
Comment 13 Carl Love 2020-11-05 21:01:03 UTC
Created attachment 133070 [details]
function support ISA 3.1 for reduced precision outer product operations

Updated functional support for reduced precision outer product operations.  The patch has the patch "Changes to make the setup_fxstate_struct() more precise merged in to it.
Comment 14 Carl Love 2020-11-10 18:29:09 UTC
Created attachment 133206 [details]
functional support ISA 3.1 for reduced precision outer product operations

Update the patch.  Fix for getACC() to eliminate compiler warnings.
Comment 15 Carl Love 2020-11-10 18:31:00 UTC
Created attachment 133208 [details]
functional support fix

cleanup of the patch
Comment 16 Carl Love 2020-11-10 18:37:58 UTC
Patches committed

commit d4cfcf14a083dcb54e0fe64ff4ae9b0b635c8e69 (HEAD -> master, origin/master, origin/HEAD)
Author: Carl Love <cel@us.ibm.com>
Date:   Tue Oct 6 12:14:45 2020 -0500

    Reduced Precision Outer Product Operation tests

commit be7da5401783a32149bd7dc31b49f3d72d4a2b10
Author: Carl Love <cel@us.ibm.com>
Date:   Tue Sep 29 13:29:34 2020 -0500

    Fix, add ISA 3.1 check to set ISA 3.1 in Valgrind hwcaps value

commit 092e5620d40d54bc1ab6a77c895fc18b0c86c6a9
Author: Carl Love <cel@us.ibm.com>
Date:   Fri Sep 25 16:54:12 2020 -0500

    ISA 3.1 Reduced-Precision: Outer Product Operations
    
    Add support for:
    
     pmxvf16ger2 Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update)
     pmxvf16ger2nn Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Negative accumulate)
     pmxvf16ger2np Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Positive accumulate)
     pmxvf16ger2pn Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Negative accumulate)
     pmxvf16ger2pp Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate)
     pmxvf32ger Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update)
     pmxvf32gernn Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate)
     pmxvf32gernp Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate)
     pmxvf32gerpn Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate)
     pmxvf32gerpp Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate)
     pmxvf64ger Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update)
     pmxvf64gernn Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate)
     pmxvf64gernp Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate)
     pmxvf64gerpn Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate)
     pmxvf64gerpp Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate)
     pmxvi16ger2s Prefixed Masked VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation
     pmxvi16ger2spp Prefixed Masked VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation (Positive multiply, Positive accumulate)
     pmxvi4ger8 Prefixed Masked VSX Vector 4-bit Signed Integer GER (rank-8 update)
     pmxvi4ger8pp Prefixed Masked VSX Vector 4-bit Signed Integer GER (rank-8 update) (Positive multiply, Positive accumulate)
     pmxvi8ger4 Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update)
     pmxvi8ger4pp Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) (Positive multiply, Positive accumulate)
    
     xvf16ger2 VSX Vector 16-bit Floating-Point GER (rank-2 update)
     xvf16ger2nn VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate)
     xvf16ger2np VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Positive accumulate)
     xvf16ger2pn VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Negative accumulate)
     xvf16ger2pp VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate)
     xvf32ger VSX Vector 32-bit Floating-Point GER (rank-1 update)
     xvf32gernn VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate)
     xvf32gernp VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate)
     xvf32gerpn VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate)
     xvf32gerpp VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate)
     xvf64ger VSX Vector 64-bit Floating-Point GER (rank-1 update)
     xvf64gernn VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate)
     xvf64gernp VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate)
     xvf64gerpn VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate)
     xvf64gerpp VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate)
     xvi16ger2s VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation
     xvi16ger2spp VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation (Positive multiply, Positive accumulate)
     xvi4ger8 VSX Vector 4-bit Signed Integer GER (rank-8 update)
     xvi4ger8pp VSX Vector 4-bit Signed Integer GER (rank-8 update) (Positive multiply, Positive accumulate)
     xvi8ger4 VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update)
     xvi8ger4pp VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) (Positive multiply, Positive accumulate)
    
     xxmfacc VSX Move From ACC
     xxmtacc VSX Move To ACC
     xxsetaccz VSX Set ACC to Zero
Comment 17 Carl Love 2020-11-10 20:27:20 UTC
The following commit was made to fix an issue with the functional support ISA 3.1 for reduced precision outer product operations support.

https://sourceware.org/git/gitweb.cgi?p=valgrind.git;h=eb82a294573d15c1be663673d55b559a82ca29d3

commit eb82a294573d15c1be663673d55b559a82ca29d3
Author: Julian Seward <jseward@acm.org>
Date:   Tue Nov 10 21:10:48 2020 +0100

    Add a missing ifdef, whose absence caused build breakage on non-POWER targets.

Diff:
---
 VEX/priv/guest_ppc_helpers.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/VEX/priv/guest_ppc_helpers.c b/VEX/priv/guest_ppc_helpers.c
index 45dce63512..ac4d7044a8 100644
--- a/VEX/priv/guest_ppc_helpers.c
+++ b/VEX/priv/guest_ppc_helpers.c
@@ -1112,12 +1112,16 @@ static ULong reinterpret_double_as_long( Double input )
 
 static Double conv_f16_to_double( ULong input )
 {
-  // This all seems to be very alignment sensitive??
-  __attribute__ ((aligned (64))) ULong src;
-  __attribute__ ((aligned (64))) Double result;
-  src = input;
-  __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src));
-  return result;
+#  if defined(__powerpc__)
+   // This all seems to be very alignment sensitive??
+   __attribute__ ((aligned (64))) ULong src;
+   __attribute__ ((aligned (64))) Double result;
+   src = input;
+   __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src));
+   return result;
+#  else
+   return 0.0;
+#  endif
 }


______________________
Comment 18 Mark Wielaard 2020-11-11 10:00:40 UTC
(In reply to Carl Love from comment #17)
> The following commit was made to fix an issue with the functional support
> ISA 3.1 for reduced precision outer product operations support.
> 
> https://sourceware.org/git/gitweb.cgi?p=valgrind.git;
> h=eb82a294573d15c1be663673d55b559a82ca29d3
> 
> commit eb82a294573d15c1be663673d55b559a82ca29d3
> Author: Julian Seward <jseward@acm.org>
> Date:   Tue Nov 10 21:10:48 2020 +0100
> 
>     Add a missing ifdef, whose absence caused build breakage on non-POWER
> targets.
> 
> Diff:
> ---
>  VEX/priv/guest_ppc_helpers.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/VEX/priv/guest_ppc_helpers.c b/VEX/priv/guest_ppc_helpers.c
> index 45dce63512..ac4d7044a8 100644
> --- a/VEX/priv/guest_ppc_helpers.c
> +++ b/VEX/priv/guest_ppc_helpers.c
> @@ -1112,12 +1112,16 @@ static ULong reinterpret_double_as_long( Double
> input )
>  
>  static Double conv_f16_to_double( ULong input )
>  {
> -  // This all seems to be very alignment sensitive??
> -  __attribute__ ((aligned (64))) ULong src;
> -  __attribute__ ((aligned (64))) Double result;
> -  src = input;
> -  __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src));
> -  return result;
> +#  if defined(__powerpc__)
> +   // This all seems to be very alignment sensitive??
> +   __attribute__ ((aligned (64))) ULong src;
> +   __attribute__ ((aligned (64))) Double result;
> +   src = input;
> +   __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src));
> +   return result;
> +#  else
> +   return 0.0;
> +#  endif
>  }

This works for everything except ppc64:
https://builder.wildebeest.org/buildbot/#/changes/1519

Is ppc64 supposed to support xscvhpdp, or is that ppc64le only?
If it depends on the binutils version (the ppc64 setup uses binutils 2.29.1 while the ppc64le setup uses binutils 2.31.1) maybe we need a configure check?
Comment 19 Carl Love 2020-11-11 16:12:15 UTC
Function conv_f16_to_double() uses instruction xscvhpdp which is an ISA 3.1 instruction.  The function is used in the ISA 3.1 instruction support.

I will work on a fix.
Comment 20 Carl Love 2020-11-12 22:20:25 UTC
Created attachment 133275 [details]
Fix for the xscvhpdp assembler line issue

The fix in comment 17 took care of the issues for non-Power platforms.  Unfortunately, older Power platforms that don't support the xscvhpdp instruction still fail.  The attached patch addresses the issue for the Power platforms.
Comment 21 Mark Wielaard 2020-11-13 12:07:03 UTC
(In reply to Carl Love from comment #20)
> Created attachment 133275 [details]
> Fix for the xscvhpdp assembler line issue
> 
> The fix in comment 17 took care of the issues for non-Power platforms. 
> Unfortunately, older Power platforms that don't support the xscvhpdp
> instruction still fail.  The attached patch addresses the issue for the
> Power platforms.

Looks good to me, and should also work fine for non-Power platforms, since then HAS_XSCVHPDP will simply never be defined.
Comment 22 Carl Love 2020-11-19 16:01:14 UTC
Committed patch to fux the xscvhpdp assembly code issue.

commit 025bdca23b2fca0dd68b6077426500237aee85c5 (origin/master, origin/HEAD)
Author: Carl Love <cel@us.ibm.com>
Date:   Thu Nov 12 15:18:23 2020 -0600

    PowerPC, fix for conv_f16_to_double xscvhpdp assembler code
    
    The previous commit:
    
      commit eb82a294573d15c1be663673d55b559a82ca29d3
      Author: Julian Seward <jseward@acm.org>
      Date:   Tue Nov 10 21:10:48 2020 +0100
    
          Add a missing ifdef, whose absence caused build breakage on non-POWER targets.
    
    fixed the compile issue in conv_f16_to_double() where non-Power platforms
    do not support the power xscvhpdp assembly instructions.  The instruction
    is supported by ISA 3.0 platforms.  Older Power platforms still fail to
    compile with the assembly instruction.  This patch fixes the if def for
    power systems that do not support ISA 3.0.
Comment 23 Carl Love 2020-11-19 16:01:49 UTC
Next set of patches is in bugzilla 429352.
Comment 24 Carl Love 2020-11-19 23:13:33 UTC
The gdb support for debugging the user application on Valgrind appears to have been broken by the reduced precision outer product patch. 

It looks like the addition of the new ACC registers is the issue.  Need to update the gdbserver files to add the ACC register descriptions and size of data being transferred.