Created attachment 133461 [details] Functional support for ISA 3.1, VSX Mask manipulation operations Patches for ISA 3.1 Power PC, part 8
Created attachment 133462 [details] Test cases for VSX Mask Manipulation operations Add test case patch
================== https://bugs.kde.org/attachment.cgi?id=133462 Test cases for VSX Mask Manipulation operations OK to land ================== https://bugs.kde.org/attachment.cgi?id=133461 Functional support for ISA 3.1, VSX Mask manipulation operations I have some concerns about the verboseness/inefficency of the generated IR. +static IRExpr * copy_MSB_bit_fields ( IRExpr *src, UInt size ) +{ + /* Input src is a V128 value. Input size is the number of bits in each + * vector field. The function copies the MSB of each vector field into + * the low order bits of the 64-bit result. + */ You can do it like this, but it looks very expensive. Note that there is an IROp which (if I understand this correctly) does what you want: /* MISC CONVERSION -- get high bits of each byte lane, a la x86/amd64 pmovmskb */ Iop_GetMSBs8x16, /* V128 -> I16 */ You could use that instead (and I'd encourage you to do so), although it does mean you'd have to handle this in the POWER backend, obviously. Although now I think about it, I am not sure what the deal is with endianness -- Iop_GetMSBs8x16 has been used so far only on little-endian targets. If Power has an instruction that can add all the lanes of a SIMD value together, creating a single number, then there's a faster way to do this, that doesn't involve Iop_GetMSBs8x16. Something like this: uint16_t getMSBs_8x16(vec128) { let hiHalf = vec128[127:64] // LE numbering let loHalf = vec128[ 63:0] // In each byte lane, copy the MSB to all bit positions hiHalf = shift_right_signed_8x8(hiHalf, 7); loHalf = shift_right_signed_8x8(loHalf, 7); // Now each byte lane is either 0x00 or 0xFF // Make (eg) lane 7 contain either 0x00 or 0x80, lane 6 contain // either 0x00 or 0x40, etc hiHalf &= 0x8040201008040201; loHalf &= 0x8040201008040201; hi8msbs = add_across_lanes_8x8(hiHalf) lo8msbs = add_across_lanes_8x8(loHalf) return (hi8msbs << 8) | lo8msbs; } There are variants, but you get the general idea from the above. + if (IFIELD(theInstr, 1, 5) == 0xA) //mtvsrbmi + inst_select = 0x9999; Add a comment to explain that this is a special-case hack for mtvsrbmi. + default: + vex_printf("dis_VSR_byte_mask (opc2)\n"); + return False; + } Don't print on failure paths; only return False. + for(i = 0; i< max; i++) { + ones_lo[i+1] = newTemp( Ity_I64 ); + test_lo[i] = newTemp( Ity_I64 ); + ones_hi[i+1] = newTemp( Ity_I64 ); + test_hi[i] = newTemp( Ity_I64 ); .. This seems really inefficient; is there no way to do this somewhat in parallel using SIMD IROps ? ==================
(In reply to Julian Seward from comment #2) > uint16_t getMSBs_8x16(vec128) > { > let hiHalf = vec128[127:64] // LE numbering > let loHalf = vec128[ 63:0] > // In each byte lane, copy the MSB to all bit positions > hiHalf = shift_right_signed_8x8(hiHalf, 7); > loHalf = shift_right_signed_8x8(loHalf, 7); > // Now each byte lane is either 0x00 or 0xFF > // Make (eg) lane 7 contain either 0x00 or 0x80, lane 6 contain > // either 0x00 or 0x40, etc > hiHalf &= 0x8040201008040201; > loHalf &= 0x8040201008040201; > hi8msbs = add_across_lanes_8x8(hiHalf) > lo8msbs = add_across_lanes_8x8(loHalf) > return (hi8msbs << 8) | lo8msbs; > } One more thought, regarding add_across_lanes_8x8(). In fact you can do this semi-reasonably using standard 64-bit scalar code, because of the nature of the values involved. Specifically, we are adding together 8 bytes, each of which is either zero or it has a 1 bit in a different location. Hence there will never be any carry-bit propagation at all in the addition, and so it can be implemented -- for this particular use case only -- as uint64_t add_across_lanes_8x8(uint64_t a) { a += (a >> 8); a += (a >> 16); a += (a >> 32); return a; } (I *think*)
return a & 0xFF, I meant.
Created attachment 136162 [details] Functional support for ISA 3.1, VSX Mask manipulation operations Updated the patch per comments. I looked at the suggestions on how to make copy_MSB_bit_fields() more efficient. PPC64 supports doing the basic algorithm that was outlined as a single V128 value except for doing the sum across lanes. That would need to be done using a clean helper. In studying the code, I realized that the copy_MSB_bit_fields() is very similar to the ISA 3.1 instruction vgnb. The difference is the result is stored in the low order bits not the high order bits. I rewrote copy_MSB_bit_fields() using the vgnb implementation as a base leveraging the existing clean helpers. The new function is much more efficient. Although I didn't use the suggested algorithm, it was very helpful to study. I realized that a number of the other instructions in the patch could be done using the arithmetic right shift much more efficiently without the use of the copy_MSB_bit_fields() or the "for(i = 0; i< max; i++)" which Julian had concerns about. Reimplemented the vexpandbm, vexpanddmn, vexpandhm, vexpandwm instructions eliminating the call to copy_MSB_bit_fileds() and the need for the for(i = 0; i< max; i++) loop. The new implementation is much more efficient. Reimplemented the mtvsrbm, mtvsrhm, mtvsrwm, mtvsrdm instructions to remove the need for the for(i = 0; i< max; i++) loop. The new implementation is much more efficient. The for(i = 0; i< max; i++) has been removed.
Thanks for the fixes. r+; please land.
Patches committed. commit 3c31707cd3d795bcdd3afc540b5aad75d1e50008 (HEAD -> master, origin/master, origin/HEAD) Author: Carl Love <cel@us.ibm.com> Date: Mon Nov 16 19:53:22 2020 -0600 VSX Mask Manipulation operation tests. commit 82777ee4080a339fb16d970411b743598b00590c Author: Carl Love <cel@us.ibm.com> Date: Fri May 1 23:49:33 2020 -0500 ISA 3.1 VSX Mask Manipulation Operations Add support for: mtvsrbmMove to VSR Byte Mask mtvsrbmiMove To VSR Byte Mask Immediate mtvsrdmMove to VSR Doubleword Mask mtvsrhmMove to VSR Halfword Mask mtvsrqmMove to VSR Quadword Mask mtvsrwmMove to VSR Word Mask vcntmbbVector Count Mask Bits Byte vcntmbdVector Count Mask Bits Doubleword vcntmbhVector Count Mask Bits Halfword vcntmbwVector Count Mask Bits Word vexpandbmVector Expand Byte Mask vexpanddmVector Expand Doubleword Mask vexpandhmVector Expand Halfword Mask vexpandqmVector Expand Quadword Mask vexpandwmVector Expand Word Mask vextractbmVector Extract Byte Mask vextractdmVector Extract Doubleword Mask vextracthmVector Extract Halfword Mask vextractqmVector Extract Quadword Mask vextractwmVector Extract Word Mask
closing