Part 6 of the ISA 3.1 support Previous patch set is in https://bugs.kde.org/show_bug.cgi?id=427401
Created attachment 132180 [details] Functional support fix
Created attachment 132181 [details] testsuite support for the reduced precision outer product operations
Created attachment 132182 [details] Funtional support ISA 3.1 for reduced precision outer product operations
The functional support for the inner product support uses a clean helper to implement the functionality. Total there will be something like 30 of these outer product instructions that need to be supported. They are similar to a matrix multiply operation which is rather complicated. It seems very unlikely that another architecture will implement something similar so adding a bunch of ppc specific IOps is not appealing. The other issue is these instructions read and write the results to the new accumulator registers. There are eight 512-bit accumulators that are read/written as a 128-bit word. Each of the eight Accumulators are associated with VSR registers. For example ACC[0][0] is associated with VSR[0], i.e. the first 128-bit word in ACC zero is associated with the first VSR0, ACC[0][1] is associated with VSR1, ACC[0][2] is associated with VSR2, ACC[0][3] is associated with VSR3, ACC[1][0] is associated with VSR4, etc. There are instructions for moving values to/from ACC[i] and its corressponding VSRs. If we were to try and implement these instructions as IOps, it would require a bit of overhead to setup the guest ACC[i] entry before issuing the instruction and then saving the results to the guest ACC[i] entry. I think it would be possible but a little messy. At this point, it appears these instructions will only show up in hand coded library routines. These are not instructions that GCC will issue for normal user code. So anyway, it seems like doing these as clean helpers is reasonable instead of creating a bunch of new IOps. The one issue with the implementation is the clean helper interface can only return 64-bits. So the helper is called twice, the first time it returns the upper 64-bits of the result, the second time the lower 64-bit of the result. Not an ideal solution. I did look at trying to extend the clean helper interface to return 128-bits but that did not appear to be very straight forward. Anyway, the code review needs to decide if this is really the best implementation or not for these instructions given the issues outlined above.
Created attachment 132620 [details] functional support ISA 3.1 for reduced precision outer product operations Updated the patch to use dirty helpers for the outer product instructions rather then the clean helper. The advantage is the helper is only called once and the 128-bit result is written directly to the guest state. Previous implementation used a clean helper that was called twice to calculate the 128-bit result but then only return the upper or lower 64-bits of the result.
Functional support fix (attachment 132180 [details]) -- OK to land. -------------------------------------------------------------------- testsuite support for the reduced precision outer product operations (attachment 132181 [details]) -- OK to land. -------------------------------------------------------------------- functional support ISA 3.1 for reduced precision outer product operations (attachment 132620 [details]) I have some correctness concerns regarding setup_fxstate_struct and some other comments. setup_fxstate_struct: this worries me. It claims that 4 regs are written, but not read. Is that really correct -- that none of the regs are read? How can that be useful? write_ACC_entry get_ACC_entry better to crash the system if `acc` or `reg` is out of range, rather than printing an error message and continuing incorrectly. Maybe just convert the if ((acc < 0) || (acc > 7)) { etc into the relevant assert: vassert(acc >= 0 && acc <= 7) etc putACC getACC similarly, assert if index or reg are invalid. setup_fxstate_struct similarly, assert if AT is out of range (+ couldn't you use a switch instead of a chain of ifs? dis_vsx_accumulator_prefix use switch instead of long chain of ifs?
Created attachment 132743 [details] functional support ISA 3.1 for reduced precision outer product operations Addressed the comments from Julian to use switch instead of if else if. Use asserts to check for bad inputs in the functions
Created attachment 132844 [details] functional support ISA 3.1 for reduced precision outer product operations Updated patch to be consistent with the prior patch change for the popcount helper.
(In reply to Carl Love from comment #8) > Created attachment 132844 [details] > functional support ISA 3.1 for reduced precision outer product operations > > Updated patch to be consistent with the prior patch change for the popcount > helper. Two things: * I tried to find the function setup_fxstate_struct in this, but failed to. What happened to it? * There is really a lot of debug printing in fns vsx_matrix_8bit_ger_helper and many others. It makes those functions hard to read. I'd prefer if you make all the test cases pass, and then remove the debug printing.
Created attachment 133058 [details] function support ISA 3.1 for reduced precision outer product operations Update the patch. Re-loaded the old file the last time.
OK to land provided that setup_fxstate_struct Read/Write/Modify annotations are set more precisely, so as to avoid false positives and false negatives from Memcheck.
Created attachment 133068 [details] Changes to make the setup_fxstate_struct() more precise. The following changes were made to the Functional support ISA 3.1 for reduced precision outer product operations to make the setup_fxstate_struct() more precise.
Created attachment 133070 [details] function support ISA 3.1 for reduced precision outer product operations Updated functional support for reduced precision outer product operations. The patch has the patch "Changes to make the setup_fxstate_struct() more precise merged in to it.
Created attachment 133206 [details] functional support ISA 3.1 for reduced precision outer product operations Update the patch. Fix for getACC() to eliminate compiler warnings.
Created attachment 133208 [details] functional support fix cleanup of the patch
Patches committed commit d4cfcf14a083dcb54e0fe64ff4ae9b0b635c8e69 (HEAD -> master, origin/master, origin/HEAD) Author: Carl Love <cel@us.ibm.com> Date: Tue Oct 6 12:14:45 2020 -0500 Reduced Precision Outer Product Operation tests commit be7da5401783a32149bd7dc31b49f3d72d4a2b10 Author: Carl Love <cel@us.ibm.com> Date: Tue Sep 29 13:29:34 2020 -0500 Fix, add ISA 3.1 check to set ISA 3.1 in Valgrind hwcaps value commit 092e5620d40d54bc1ab6a77c895fc18b0c86c6a9 Author: Carl Love <cel@us.ibm.com> Date: Fri Sep 25 16:54:12 2020 -0500 ISA 3.1 Reduced-Precision: Outer Product Operations Add support for: pmxvf16ger2 Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) pmxvf16ger2nn Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Negative accumulate) pmxvf16ger2np Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Positive accumulate) pmxvf16ger2pn Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Negative accumulate) pmxvf16ger2pp Prefixed Masked VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate) pmxvf32ger Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) pmxvf32gernn Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate) pmxvf32gernp Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate) pmxvf32gerpn Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate) pmxvf32gerpp Prefixed Masked VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate) pmxvf64ger Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) pmxvf64gernn Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate) pmxvf64gernp Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate) pmxvf64gerpn Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate) pmxvf64gerpp Prefixed Masked VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate) pmxvi16ger2s Prefixed Masked VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation pmxvi16ger2spp Prefixed Masked VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation (Positive multiply, Positive accumulate) pmxvi4ger8 Prefixed Masked VSX Vector 4-bit Signed Integer GER (rank-8 update) pmxvi4ger8pp Prefixed Masked VSX Vector 4-bit Signed Integer GER (rank-8 update) (Positive multiply, Positive accumulate) pmxvi8ger4 Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) pmxvi8ger4pp Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) (Positive multiply, Positive accumulate) xvf16ger2 VSX Vector 16-bit Floating-Point GER (rank-2 update) xvf16ger2nn VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate) xvf16ger2np VSX Vector 16-bit Floating-Point GER (rank-2 update) (Negative multiply, Positive accumulate) xvf16ger2pn VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Negative accumulate) xvf16ger2pp VSX Vector 16-bit Floating-Point GER (rank-2 update) (Positive multiply, Positive accumulate) xvf32ger VSX Vector 32-bit Floating-Point GER (rank-1 update) xvf32gernn VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate) xvf32gernp VSX Vector 32-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate) xvf32gerpn VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate) xvf32gerpp VSX Vector 32-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate) xvf64ger VSX Vector 64-bit Floating-Point GER (rank-1 update) xvf64gernn VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Negative accumulate) xvf64gernp VSX Vector 64-bit Floating-Point GER (rank-1 update) (Negative multiply, Positive accumulate) xvf64gerpn VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Negative accumulate) xvf64gerpp VSX Vector 64-bit Floating-Point GER (rank-1 update) (Positive multiply, Positive accumulate) xvi16ger2s VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation xvi16ger2spp VSX Vector 16-bit Signed Integer GER (rank-2 update) with Saturation (Positive multiply, Positive accumulate) xvi4ger8 VSX Vector 4-bit Signed Integer GER (rank-8 update) xvi4ger8pp VSX Vector 4-bit Signed Integer GER (rank-8 update) (Positive multiply, Positive accumulate) xvi8ger4 VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) xvi8ger4pp VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update) (Positive multiply, Positive accumulate) xxmfacc VSX Move From ACC xxmtacc VSX Move To ACC xxsetaccz VSX Set ACC to Zero
The following commit was made to fix an issue with the functional support ISA 3.1 for reduced precision outer product operations support. https://sourceware.org/git/gitweb.cgi?p=valgrind.git;h=eb82a294573d15c1be663673d55b559a82ca29d3 commit eb82a294573d15c1be663673d55b559a82ca29d3 Author: Julian Seward <jseward@acm.org> Date: Tue Nov 10 21:10:48 2020 +0100 Add a missing ifdef, whose absence caused build breakage on non-POWER targets. Diff: --- VEX/priv/guest_ppc_helpers.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/VEX/priv/guest_ppc_helpers.c b/VEX/priv/guest_ppc_helpers.c index 45dce63512..ac4d7044a8 100644 --- a/VEX/priv/guest_ppc_helpers.c +++ b/VEX/priv/guest_ppc_helpers.c @@ -1112,12 +1112,16 @@ static ULong reinterpret_double_as_long( Double input ) static Double conv_f16_to_double( ULong input ) { - // This all seems to be very alignment sensitive?? - __attribute__ ((aligned (64))) ULong src; - __attribute__ ((aligned (64))) Double result; - src = input; - __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src)); - return result; +# if defined(__powerpc__) + // This all seems to be very alignment sensitive?? + __attribute__ ((aligned (64))) ULong src; + __attribute__ ((aligned (64))) Double result; + src = input; + __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src)); + return result; +# else + return 0.0; +# endif } ______________________
(In reply to Carl Love from comment #17) > The following commit was made to fix an issue with the functional support > ISA 3.1 for reduced precision outer product operations support. > > https://sourceware.org/git/gitweb.cgi?p=valgrind.git; > h=eb82a294573d15c1be663673d55b559a82ca29d3 > > commit eb82a294573d15c1be663673d55b559a82ca29d3 > Author: Julian Seward <jseward@acm.org> > Date: Tue Nov 10 21:10:48 2020 +0100 > > Add a missing ifdef, whose absence caused build breakage on non-POWER > targets. > > Diff: > --- > VEX/priv/guest_ppc_helpers.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/VEX/priv/guest_ppc_helpers.c b/VEX/priv/guest_ppc_helpers.c > index 45dce63512..ac4d7044a8 100644 > --- a/VEX/priv/guest_ppc_helpers.c > +++ b/VEX/priv/guest_ppc_helpers.c > @@ -1112,12 +1112,16 @@ static ULong reinterpret_double_as_long( Double > input ) > > static Double conv_f16_to_double( ULong input ) > { > - // This all seems to be very alignment sensitive?? > - __attribute__ ((aligned (64))) ULong src; > - __attribute__ ((aligned (64))) Double result; > - src = input; > - __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src)); > - return result; > +# if defined(__powerpc__) > + // This all seems to be very alignment sensitive?? > + __attribute__ ((aligned (64))) ULong src; > + __attribute__ ((aligned (64))) Double result; > + src = input; > + __asm__ __volatile__ ("xscvhpdp %x0,%x1" : "=wa" (result) : "wa" (src)); > + return result; > +# else > + return 0.0; > +# endif > } This works for everything except ppc64: https://builder.wildebeest.org/buildbot/#/changes/1519 Is ppc64 supposed to support xscvhpdp, or is that ppc64le only? If it depends on the binutils version (the ppc64 setup uses binutils 2.29.1 while the ppc64le setup uses binutils 2.31.1) maybe we need a configure check?
Function conv_f16_to_double() uses instruction xscvhpdp which is an ISA 3.1 instruction. The function is used in the ISA 3.1 instruction support. I will work on a fix.
Created attachment 133275 [details] Fix for the xscvhpdp assembler line issue The fix in comment 17 took care of the issues for non-Power platforms. Unfortunately, older Power platforms that don't support the xscvhpdp instruction still fail. The attached patch addresses the issue for the Power platforms.
(In reply to Carl Love from comment #20) > Created attachment 133275 [details] > Fix for the xscvhpdp assembler line issue > > The fix in comment 17 took care of the issues for non-Power platforms. > Unfortunately, older Power platforms that don't support the xscvhpdp > instruction still fail. The attached patch addresses the issue for the > Power platforms. Looks good to me, and should also work fine for non-Power platforms, since then HAS_XSCVHPDP will simply never be defined.
Committed patch to fux the xscvhpdp assembly code issue. commit 025bdca23b2fca0dd68b6077426500237aee85c5 (origin/master, origin/HEAD) Author: Carl Love <cel@us.ibm.com> Date: Thu Nov 12 15:18:23 2020 -0600 PowerPC, fix for conv_f16_to_double xscvhpdp assembler code The previous commit: commit eb82a294573d15c1be663673d55b559a82ca29d3 Author: Julian Seward <jseward@acm.org> Date: Tue Nov 10 21:10:48 2020 +0100 Add a missing ifdef, whose absence caused build breakage on non-POWER targets. fixed the compile issue in conv_f16_to_double() where non-Power platforms do not support the power xscvhpdp assembly instructions. The instruction is supported by ISA 3.0 platforms. Older Power platforms still fail to compile with the assembly instruction. This patch fixes the if def for power systems that do not support ISA 3.0.
Next set of patches is in bugzilla 429352.
The gdb support for debugging the user application on Valgrind appears to have been broken by the reduced precision outer product patch. It looks like the addition of the new ACC registers is the issue. Need to update the gdbserver files to add the ACC register descriptions and size of data being transferred.