Bug 429354 - PPC ISA 3.1 support is missing, part 8
Summary: PPC ISA 3.1 support is missing, part 8
Status: CLOSED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-19 15:55 UTC by Carl Love
Modified: 2021-03-04 19:33 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Functional support for ISA 3.1, VSX Mask manipulation operations (18.42 KB, patch)
2020-11-19 15:55 UTC, Carl Love
Details
Test cases for VSX Mask Manipulation operations (27.27 KB, patch)
2020-11-19 15:56 UTC, Carl Love
Details
Functional support for ISA 3.1, VSX Mask manipulation operations (26.88 KB, patch)
2021-02-25 18:18 UTC, Carl Love
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carl Love 2020-11-19 15:55:37 UTC
Created attachment 133461 [details]
Functional support for ISA 3.1, VSX Mask manipulation operations

Patches for ISA 3.1 Power PC, part 8
Comment 1 Carl Love 2020-11-19 15:56:49 UTC
Created attachment 133462 [details]
Test cases for VSX Mask Manipulation operations

Add test case patch
Comment 2 Julian Seward 2021-02-13 09:34:06 UTC
==================

https://bugs.kde.org/attachment.cgi?id=133462
Test cases for VSX Mask Manipulation operations

OK to land

==================

https://bugs.kde.org/attachment.cgi?id=133461
Functional support for ISA 3.1, VSX Mask manipulation operations

I have some concerns about the verboseness/inefficency of the generated IR.

+static IRExpr * copy_MSB_bit_fields (  IRExpr *src, UInt size )
+{
+   /* Input src is a V128 value.  Input size is the number of bits in each
+    * vector field.  The function copies the MSB of each vector field into
+    * the low order bits of the 64-bit result.
+    */

You can do it like this, but it looks very expensive.  Note that there is
an IROp which (if I understand this correctly) does what you want:

      /* MISC CONVERSION -- get high bits of each byte lane, a la
         x86/amd64 pmovmskb */
      Iop_GetMSBs8x16, /* V128 -> I16 */

You could use that instead (and I'd encourage you to do so), although
it does mean you'd have to handle this in the POWER backend, obviously.

Although now I think about it, I am not sure what the deal is with
endianness -- Iop_GetMSBs8x16 has been used so far only on little-endian
targets.

If Power has an instruction that can add all the lanes of a SIMD value
together, creating a single number, then there's a faster way to do this, that
doesn't involve Iop_GetMSBs8x16.  Something like this:

   uint16_t getMSBs_8x16(vec128)
   {
      let hiHalf = vec128[127:64]  // LE numbering
      let loHalf = vec128[ 63:0]
      // In each byte lane, copy the MSB to all bit positions
      hiHalf = shift_right_signed_8x8(hiHalf, 7);
      loHalf = shift_right_signed_8x8(loHalf, 7);
      // Now each byte lane is either 0x00 or 0xFF
      // Make (eg) lane 7 contain either 0x00 or 0x80, lane 6 contain
      // either 0x00 or 0x40, etc
      hiHalf &= 0x8040201008040201;
      loHalf &= 0x8040201008040201;
      hi8msbs = add_across_lanes_8x8(hiHalf)
      lo8msbs = add_across_lanes_8x8(loHalf)
      return (hi8msbs << 8) | lo8msbs;
   }

There are variants, but you get the general idea from the above.

+   if (IFIELD(theInstr, 1, 5) == 0xA)    //mtvsrbmi
+      inst_select = 0x9999;

Add a comment to explain that this is a special-case hack for mtvsrbmi.

+   default:
+      vex_printf("dis_VSR_byte_mask (opc2)\n");
+      return False;
+   }

Don't print on failure paths; only return False.


+   for(i = 0; i< max; i++) {
+      ones_lo[i+1] = newTemp( Ity_I64 );
+      test_lo[i] = newTemp( Ity_I64 );
+      ones_hi[i+1] = newTemp( Ity_I64 );
+      test_hi[i] = newTemp( Ity_I64 );
..

This seems really inefficient; is there no way to do this somewhat
in parallel using SIMD IROps ?

==================
Comment 3 Julian Seward 2021-02-15 16:14:57 UTC
(In reply to Julian Seward from comment #2)

>    uint16_t getMSBs_8x16(vec128)
>    {
>       let hiHalf = vec128[127:64]  // LE numbering
>       let loHalf = vec128[ 63:0]
>       // In each byte lane, copy the MSB to all bit positions
>       hiHalf = shift_right_signed_8x8(hiHalf, 7);
>       loHalf = shift_right_signed_8x8(loHalf, 7);
>       // Now each byte lane is either 0x00 or 0xFF
>       // Make (eg) lane 7 contain either 0x00 or 0x80, lane 6 contain
>       // either 0x00 or 0x40, etc
>       hiHalf &= 0x8040201008040201;
>       loHalf &= 0x8040201008040201;
>       hi8msbs = add_across_lanes_8x8(hiHalf)
>       lo8msbs = add_across_lanes_8x8(loHalf)
>       return (hi8msbs << 8) | lo8msbs;
>    }

One more thought, regarding add_across_lanes_8x8().  In fact you can
do this semi-reasonably using standard 64-bit scalar code, because of the
nature of the values involved.  Specifically, we are adding together 8 bytes,
each of which is either zero or it has a 1 bit in a different location.
Hence there will never be any carry-bit propagation at all in the addition,
and so it can be implemented -- for this particular use case only -- as

   uint64_t add_across_lanes_8x8(uint64_t a) {
      a += (a >> 8);
      a += (a >> 16);
      a += (a >> 32);
      return a;
   }

(I *think*)
Comment 4 Julian Seward 2021-02-15 16:16:10 UTC
return a & 0xFF, I meant.
Comment 5 Carl Love 2021-02-25 18:18:52 UTC
Created attachment 136162 [details]
Functional support for ISA 3.1, VSX Mask manipulation operations

Updated the patch per comments.

I looked at the suggestions on how to make copy_MSB_bit_fields() more efficient.  PPC64 supports doing the basic algorithm that was outlined as a single V128 value except for doing the sum across lanes.  That would need to be done using a clean helper.  In studying the code, I realized that the copy_MSB_bit_fields() is very similar to the ISA 3.1 instruction vgnb.  The difference is the result is stored in the low order bits not the high order bits.  I rewrote copy_MSB_bit_fields() using the vgnb implementation as a base leveraging the existing clean helpers.  The new function is much more efficient.  Although I didn't use the suggested algorithm, it was very helpful to study.  I realized that a number of the other instructions in the patch could be done using the arithmetic right shift much more efficiently without the use of the copy_MSB_bit_fields() or the "for(i = 0; i< max; i++)" which Julian had concerns about.

Reimplemented the vexpandbm, vexpanddmn, vexpandhm, vexpandwm instructions  eliminating the call to copy_MSB_bit_fileds() and the need for the
for(i = 0; i< max; i++) loop.  The new implementation is much more efficient.

Reimplemented the mtvsrbm, mtvsrhm, mtvsrwm, mtvsrdm instructions to remove the
need for the for(i = 0; i< max; i++) loop.  The new implementation is much more efficient.
 
The for(i = 0; i< max; i++) has been removed.
Comment 6 Julian Seward 2021-03-04 17:38:35 UTC
Thanks for the fixes.  r+; please land.
Comment 7 Carl Love 2021-03-04 19:31:59 UTC
Patches committed.

commit 3c31707cd3d795bcdd3afc540b5aad75d1e50008 (HEAD -> master, origin/master, origin/HEAD)
Author: Carl Love <cel@us.ibm.com>
Date:   Mon Nov 16 19:53:22 2020 -0600

    VSX Mask Manipulation operation tests.

commit 82777ee4080a339fb16d970411b743598b00590c
Author: Carl Love <cel@us.ibm.com>
Date:   Fri May 1 23:49:33 2020 -0500

    ISA 3.1 VSX Mask Manipulation Operations
    
    Add support for:
    
    mtvsrbmMove to VSR Byte Mask
    mtvsrbmiMove To VSR Byte Mask Immediate
    mtvsrdmMove to VSR Doubleword Mask
    mtvsrhmMove to VSR Halfword Mask
    mtvsrqmMove to VSR Quadword Mask
    mtvsrwmMove to VSR Word Mask
    vcntmbbVector Count Mask Bits Byte
    vcntmbdVector Count Mask Bits Doubleword
    vcntmbhVector Count Mask Bits Halfword
    vcntmbwVector Count Mask Bits Word
    vexpandbmVector Expand Byte Mask
    vexpanddmVector Expand Doubleword Mask
    vexpandhmVector Expand Halfword Mask
    vexpandqmVector Expand Quadword Mask
    vexpandwmVector Expand Word Mask
    vextractbmVector Extract Byte Mask
    vextractdmVector Extract Doubleword Mask
    vextracthmVector Extract Halfword Mask
    vextractqmVector Extract Quadword Mask
    vextractwmVector Extract Word Mask
Comment 8 Carl Love 2021-03-04 19:33:31 UTC
closing